Giter VIP home page Giter VIP logo

lda4j's Introduction

LDA4j

A Java implemention of LDA(Latent Dirichlet Allocation). Inference topics from a set of documents with few lines of Java code.

How To Use

  • code
public static void main(String[] args)
{
    // 1. Load corpus from disk
    Corpus corpus = Corpus.load("data/mini");
    // 2. Create a LDA sampler
    LdaGibbsSampler ldaGibbsSampler = new LdaGibbsSampler(corpus.getDocument(), corpus.getVocabularySize());
    // 3. Train it
    ldaGibbsSampler.gibbs(10);
    // 4. The phi matrix is a LDA model, you can use LdaUtil to explain it.
    double[][] phi = ldaGibbsSampler.getPhi();
    Map<String, Double>[] topicMap = LdaUtil.translate(phi, corpus.getVocabulary(), 10);
    LdaUtil.explain(topicMap);
}
  • output
topic 0 :
公司=0.009538408630174017
市场=0.008848009751698062
**=0.008756489189917975
企业=0.0068280510303913395
发展=0.005991900977658479
目前=0.004408401842957633
产品=0.0041981128106208625
服务=0.003756081561227181
已经=0.003410105744626914
记者=0.003289155629929911

topic 1 :
专业=0.00872496522205349
工作=0.008108171408190876
学生=0.00793944661866665
学校=0.006307480899983371
考生=0.005295205701518912
大学=0.0052671267600129445
教育=0.0051547106121291805
考试=0.00507254577329609
人才=0.004037747449851247
招聘=0.003913811857165103

topic 2 :
医院=0.006197066939127888
治疗=0.0048149451145789455
患者=0.0032264139617756145
健康=0.0026521203697810374
手术=0.0025525793863978826
女性=0.0023724111474892357
专家=0.0021711200905248276
发现=0.0021645199996586885
病人=0.0021567877663232846
医生=0.002155356316589454

topic 3 :
没有=0.008818728535385055
问题=0.00476170232225101
**=0.00476161560515722
工作=0.004610303190509696
生活=0.004283310385880329
文化=0.0036558079614339278
孩子=0.003327977201447208
不能=0.0032901108349775716
知道=0.003127437274214269
已经=0.0030419673256694545

topic 4 :
公司=0.018241005428669386
股东=0.009281048036676322
股份=0.0078638937643388
搜狐=0.0065617441267974705
有限公司=0.006139808167975946
直播员=0.005439495997416965
股权=0.005353954615162839
项目=0.004984451830097043
发行=0.004511099443364358
改革=0.004489038403046334

topic 5 :
旅游=0.013331508385667979
游客=0.004296589238778804
城市=0.0032312276892446116
文化=0.0026831367778820704
旅行社=0.002242817493567529
世界=0.0021001546909288965
成都=0.001991337289815279
活动=0.001894687770595843
北京=0.0017106388886854072
公园=0.0016134766410937638

topic 6 :
美国=0.007679518424242107
日本=0.004777746687572576
训练=0.003947682941243526
系统=0.003926562149803556
飞机=0.0038757503504304267
部队=0.00365041154980242
进行=0.003644226666909795
军事=0.003637873811678725
作战=0.003407296869780034
装备=0.003319112427162246

topic 7 :
比赛=0.0092171879571152
队员=0.0036851386114063237
联赛=0.0032845199043377146
球队=0.0029432131822116707
冠军=0.0024090127058022104
俱乐部=0.002348957542679953
球员=0.0022159606741087795
决赛=0.002192739194333911
赛季=0.0020352324832133267
对手=0.001974829226645783

topic 8 :
The=0.002190604616155811
意思=0.001186435720799536
It=0.0011515962078723501
理解=0.0010433831740419728
What=9.560997173453189E-4
They=9.345962358267594E-4
听力=8.362275772461826E-4
In=8.166984660263638E-4
阅读=7.775969918239417E-4
译文=7.568900132152651E-4

topic 9 :
***=0.002633793448326645
曹操=0.0018832599387516155
曹丕=0.0016567353952110328
皇帝=0.001629990508040292
甄洛=0.0012930147890964736
**=0.0012783947883529055
蒋介石=0.0010732052837016102
曹睿=8.511476483731437E-4
女王=8.125680914406854E-4
皇后=8.013815303127338E-4
  • corpus The data/mini is some documents included in this project, which use space to segment words. Feel free to replace it with yours.
  • algorithm Mainly depend on Gregor Heinrich's great work. Read more about this implementation on《LDA入门与Java实现》

lda4j's People

Contributors

dhj199268 avatar hankcs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lda4j's Issues

憨批

LdaUtil
Map<Double, String> rankMap = new TreeMap<Double, String>(Collections.reverseOrder());
把double和string互换一下谢谢

有相关的api文档吗,如何获得新文档的主题分布?

有相关的api文档吗,请问如何获得新文档的主题分布?
int[] document = Corpus.loadDocument("data/mini/军事_510.txt", corpus.getVocabulary());
double[] tp = LdaGibbsSampler.inference(phi, document);
Map<String, Double> topic = LdaUtil.translate(tp, phi, corpus.getVocabulary(), 100);
LdaUtil.explain(topic);

Inference新文档中有生单词问题

首先感谢hankcs博主的分享,在使用LDA4j的过程中,我重写了Corpus类里的load和loadDocument方法,从数据库中读写数据测试成功。
但是测试的过程中遇到了一个问题,就是先用训练集训练出来phi,然后拿来一个新文档使用这个phi推断其概率分布发现报数组越界的错误,我初步调试发现一旦新文档中包含训练集中没有的生单词,你写的Inference便无法使用,这个问题希望博主能进行一下异常处理。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.