Giter VIP home page Giter VIP logo

Comments (6)

haoawesome avatar haoawesome commented on August 23, 2024

讨论
AixinSG:Undersampling 总体上效果有限,个人理解

刘知远THU: 不平衡数据分类,尤其是标注正例特别多,几乎没有标注负例,但有大量未标注数据的话,应当怎么处理呢?这个问题在relation extraction中很普遍。现在只能在大量未标注数据中随机抽样作为负例。

xierqi: 有段调研过这方面,90%都是采样,最大问题是评估方法不适合真实场景。个人推荐domingos的meta-cost,非常实用,经验设下cost就好。http://t.cn/RPiexE9

eacl_newsmth: 在关系抽取中,是正例特别多? 没有负例么?我怎么觉得很多情况下是正例有限,但负例很多(当然你也可以argue说负例其实很难界定)。。。。

刘知远THU:回复@eacl_newsmth: 就像knowledge graph中可以提供很多正例,但负例需要通过随机替换正例中的entity来产生,这样容易把也是正确的样例当成负例来看。

eacl_newsmth:回复@刘知远THU:恩,我估计你就要说这个例子,所以我在后面说,看你怎么界定负例,哈哈,我也纠结过好久,后来觉得其实还是正例少,而且很多时候你能保证正例是对的么?

刘知远THU:回复@eacl_newsmth: 正例基本是正确的,例如来自Freebase的,但负例对效果影响很大。:)今年AAAI有篇MSRA做的TransH的模型中,就提出一个负例选取的trick,效果拔群。

eacl_newsmth:回复@刘知远THU:恩,KB中的实例确实是正确的,但是依据这些实例去海量文档中寻找的那些样本未必是正确的啊。 就目前的工作来看,确实很多在负例上做文章的工作都能把效率提升一些,去年语言所的一个学生利用“关系”特性,优选训练样本,也确实能提升性能。但单就这个问题而言,不能回避正例的可靠性

刘知远THU:回复@eacl_newsmth: 你说的这篇文章能告诉一下题目么?我现在关注的还不是从文本中抽关系,而是做knowledge graph completion,有点类似于graph上的link prediction,但要预测的link是有不同类型的relation。

eacl_newsmth:回复@刘知远THU:http://t.cn/RPX75A3 恩,看了你们那里一个小伙的talk,感觉和sebastian之前的工作很相关啊,也许是他表述的问题?啥时候回北京?可以好好讨论一下。

from hao.

haoawesome avatar haoawesome commented on August 23, 2024

search keywords

Positive only
Imbalanced data

readings

http://homes.cs.washington.edu/~pedrod/papers/kdd99.pdf (@xierqi 推荐) Domingo, MetaCost: A General Method for Making Classifiers Cost

http://www.aclweb.org/anthology/P/P13/P13-2141.pdf (@eacl_newsmth 推荐) Towards Accurate Distant Supervision for Relational Facts Extraction

http://cseweb.ucsd.edu/~elkan/posonly.pdf Learning Classifiers from Only Positive and Unlabeled Data

http://www.ele.uri.edu/faculty/he/PDFfiles/ImbalancedLearning.pdf He and Haibo He, Edwardo A. Garcia . (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284.

http://www.computer.org/csdl/proceedings/icnc/2008/3304/04/3304d192-abs.html Guo, X., Yin, Y., Dong, C., Yang, G., & Zhou, G. (2008). On the Class Imbalance Problem. 2008 Fourth International Conference on Natural Computation (pp. 192-201).

tools

http://www.nltk.org/_modules/nltk/classify/positivenaivebayes.html nltk

http://weka.wikispaces.com/MetaCost Weka

from hao.

haoawesome avatar haoawesome commented on August 23, 2024

datasets

http://pages.cs.wisc.edu/~dpage/kddcup2001/ Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin

https://archive.ics.uci.edu/ml/datasets.html?format=&task=cla&att=&area=&numAtt=&numIns=&type=&sort=nameUp&view=table UCI dataset repo, classification category

更多数据:
http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.html

from hao.

haoawesome avatar haoawesome commented on August 23, 2024

关于不平衡数据分类(Imbalanced data classification),整理了一个稿子,看看还有没有需要补充的
https://github.com/memect/hao/blob/master/awesome/imbalanced-data-classification.md

相关讨论纪录: #47

from hao.

haoawesome avatar haoawesome commented on August 23, 2024

[资源整理] 不平衡数据分类(Imbalanced data classification): http://memect.co/hIYTr7R 经典文献 MetaCost (Domingo, 1999), SMOTE(2002 Chawla), 以及2004 CMU Yanjun Qi 的综述(现UVA教授);工具与数据集(WEKA,NLTK), GITHUB SMOTE的实现。感谢 @AixinSG @刘知远THU @xierqi @eacl_newsmth

http://www.weibo.com/5220650532/BiZQEloKK?ref=#_rnd1408426979569

from hao.

haoawesome avatar haoawesome commented on August 23, 2024

好东西传送门:回复@朱小强_Bigeye_THU: http://t.cn/RP8jyzY "The most interesting compromise in terms of model complexity and AUC is MetaCost using PART as the base classification algorithm. AdaBoost yields higher AUC values but high complexity models."

from hao.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.