sudar / yahoo_lda Goto Github PK
View Code? Open in Web Editor NEWYahoo!'s topic modelling framework using Latent Dirichlet Allocation
License: Apache License 2.0
Yahoo!'s topic modelling framework using Latent Dirichlet Allocation
License: Apache License 2.0
The Yahoo_LDA project uses several 3rd party open source libraries and tools. This file summarizes the tools used, their purpose, and the licenses under which they're released. Except as specifically stated below, the 3rd party software packages are not distributed as part of this project, but instead are separately downloaded and built on the developer’s machine as a pre-build step. * Ice-3.4.1 (GNU GENERAL PUBLIC LICENSE) * An efficient inter process communication framework which is used for the distributed storage of (topic, word) tables. * http://www.zeroc.com/ * cppunit-1.12.1 (GNU LESSER GENERAL PUBLIC LICENSE) * C++ unit testing framework. We use this for unit tests. * http://cppunit.sourceforge.net * glog-0.3.0 (BSD) * Logfile generation (Google's log library). * http://code.google.com/p/google-glog/ * mcpp-2.7.2 (BSD) * C++ preprocessor * http://mcpp.sourceforge.net/ * tbb22_20090809oss (GNU GENERAL PUBLIC LICENSE) * Intel Threading Building Blocks. Multithreaded processing library. Much easier to use than pthreads. We use the pipeline class. * http://threadingbuildingblocks.org * bzip2-1.0.5 (BSD) * Data compression * http://www.bzip.org/ * gflags-1.2 (BSD) * Google's flag processing library (used for commandline options) * http://code.google.com/p/google-gflags/ * protobuf-2.2.0a (BSD) * Protocol buffers (used for serializing data to disk and as internal key data structure). Google's serialization library * http://code.google.com/p/protobuf/ * boost-1.46.0 (Boost Software License - Version 1.0 - August 17th, 2003) * Boost Libraries (various datatypes) * http://www.boost.org/ Please refer to the html or pdf documentation present at docs/html/index.html & docs/latex/refman.pdf respectively for more information.
How do I decide how many topics should be returned by LDA? Does Expectation Maximization helps in determining topics count. If Yes, how do I calculate it.
Hi
When I am running Yahoo LDA on my hadoop cluster, I found the following problems:
${HADOOP_CMD} dfs -get ${mapred_output_dir}/global/lda.dict.dump lda.dict.dump.global
it cannot download the file and whole process is going crashed. So, I put the synchronization code such as wait_for 60 ${mapred_output_dir}/global/lda.dict.dump.
Finally, I got the following problem, this is not related with running script, so how can I recover this situation?
1020 03:57:06.626588 20423 Merge_Topic_Counts.cpp:103] Initializing global dictionary from lda.dict.dump.global
W1020 03:57:11.659412 20423 Merge_Topic_Counts.cpp:105] global dictionary Initialized
terminate called after throwing an instance of 'Ice::ConnectionLostException'
what(): TcpTransceiver.cpp:248: Ice::ConnectionLostException:
connection lost: Connection reset by peer
Should I modify LDA.sh script to check the error code of each module execution and repeat unless the error code is success?
Thank you!
Hi
I ran the train mode on a corpus of size about 1G. I tried twice, each with 500 topics and 500 iterations. But I got two quite different results. I means the 2 files "lad.topToWor.txt" from 2 train results are quite different. I compared the words on each topic( ignored weight) . Only 250 topics on 2 files are similar ( more than 10 words are matched, which I can say that 2 topics in 2 files are similar )
This means that I will get quite different results from a random initialization. Is there a way that I can get a stable result?
increase the number of topics? or increase the iterations? I tried 1000 iterations but no big change.
Thanks!
Yanbo
DM_Client::add_server: DM_Server_0:default -h 10.101.173.51 -p 25342
Outgoing.cpp:424: Ice::ObjectNotExistException:
object does not exist:
identity: `DM_Server_0'
facet:
operation: ice_isA
terminate called after throwing an instance of 'IceUtil::Exception'
what(): Outgoing.cpp:424: IceUtil::Exception
*** Aborted at 1325947898 (unix time) try "date -d @1325947898" if you are using GNU date ***
PC: @ 0x3d0dc30265 (unknown)
*** SIGABRT (@0xc3b800005280) received by PID 21120 (TID 0x2b1b54a9d8e0) from PID 21120; stack trace: ***
@ 0x3d0e40eb10 (unknown)
@ 0x3d0dc30265 (unknown)
@ 0x3d0dc31d10 (unknown)
@ 0x2b1b54858d14 (unknown)
@ 0x2b1b54856e16 (unknown)
@ 0x2b1b54856e43 (unknown)
@ 0x2b1b54856f2a (unknown)
@ 0x43faa4 DM_Client::add_server()
@ 0x44042e DM_Client::DM_Client()
@ 0x49da95 Unigram_Model_Synchronizer_Helper::Unigram_Model_Synchronizer_Helper()
@ 0x49d4b5 Unigram_Model_Synchronized_Training_Builder::create_execution_strategy()
@ 0x447d04 Model_Director::build_model()
@ 0x4396ab main
@ 0x3d0dc1d994 (unknown)
@ 0x40d8c9 (unknown)
/data/1/mr/local/taskTracker/jwang/jobcache/job_201112201444_4463/attempt_201112201444_4463_m_000000_0/work/./LDA.sh: line 103: 21120 Aborted $LDALIBS/learntopics --model=$model --iter=$iters --topics=$topics --servers="$servers" --chkptdir="${MY_INP_DIR}" $flags 1>&2
Synch directory: hdfs://hadooprsonn001.bo1.shopzilla.sea/user/jwang/workspace/ldanew/temporary/synchronize/learntopics
Num of map tasks: 10
Found 8 items
-rw-r--r-- 3 mapred supergroup 9 2012-01-07 06:51 /user/jwang/workspace/ldanew/temporary/synchronize/learntopics/0
-rw-r--r-- 3 mapred supergroup 9 2012-01-07 06:51 /user/jwang/workspace/ldanew/temporary/synchronize/learntopics/1
-rw-r--r-- 3 mapred supergroup 9 2012-01-07 06:51 /user/jwang/workspace/ldanew/temporary/synchronize/learntopics/3
-rw-r--r-- 3 mapred supergroup 9 2012-01-07 06:51 /user/jwang/workspace/ldanew/temporary/synchronize/learntopics/4
-rw-r--r-- 3 mapred supergroup 9 2012-01-07 06:51 /user/jwang/workspace/ldanew/temporary/synchronize/learntopics/5
-rw-r--r-- 3 mapred supergroup 9 2012-01-07 06:51 /user/jwang/workspace/ldanew/temporary/synchronize/learntopics/6
-rw-r--r-- 3 mapred supergroup 9 2012-01-07 06:51 /user/jwang/workspace/ldanew/temporary/synchronize/learntopics/7
-rw-r--r-- 3 mapred supergroup 9 2012-01-07 06:51 /user/jwang/workspace/ldanew/temporary/synchronize/learntopics/8
Num of clients done: 7
Sleeping
I run
./runLDA.sh 1 "data" train default "/user/jwang/simulation/data/offers/current" "/user/jwang/workspace/lda/" 3244 1000 5 /user/jwang/software/YLDA/LDALibs.jar 20 "/user/jwang/workspace/ldatemp"
There are text files in /user/jwang/simulation/data/offers/current, where each line is a document.
It invoked a streaming job, where mapper run successfully. However, the reducers failed. Here is the error associated with one of reducers:
java.io.IOException: subprocess exited with error code 1
R/W/S=540/0/0 in:77=540/7 [rec/s] out:0=0/7 [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
HOST=null
USER=root
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |null|
Date: Thu Jan 05 16:04:36 PST 2012
Broken pipe
at org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:131)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:469)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Any idea?
I have an hadoop distribution running locally.
I followed the multi machine setup in the distribution,and uploaded LDALibs.jar to the HDFS and also put there a ut_out folder.
When running the script -
./runLDA.sh 1 "" train default ./ut_out/ydir_1k.txt ./ld 8000 25 500 hdfs://localhost:9000/.../LDALibs.jar 1 I
it fails.
This is the output I get:
and the error log for the task that fails is:
model=1 flags=" " trained_data=""
/tmp/hadoop-datastore/hadoop/mapred/local/taskTracker/jobcache/job_201107240937_0039/attempt_201107240937_0039_r_000000_0/work/./Formatter.sh: line 31: /tmp/hadoop-datastore/hadoop/mapred/local/taskTracker/jobcache/job_201107240937_0039/attempt_201107240937_0039_r_000000_0/work/LDALibs/formatter: Permission denied
Formatter returned an error code of 126
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 126
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:137)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:473)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient).
log4j:WARN Please initialize the log4j system properly.
Any ideas/thoughs what wrong?
Thanks
Yuval
Hello,
I wasn't sure which was the best forum to post this issue/question to - the yahoo groups or hear. It seems issues have more activity than in the groups. (I've cross posted: http://tech.groups.yahoo.com/group/y_lda/message/15)
I'm a total newbie to LDA, so please forgive me if I don't quite formulate this
question concisely.
From the single machine instructions for "Using the Model"
(/Yahoo_LDA/docs/html/single__machine__usage.html#using_model) it indicates that
you can run in either batch OR streaming mode.
In batch mode, the output are several files: lda.docToTop.txt lda.topToWor.txt
lda.worToTop.txt
lda.docToTop.txt is what I like - document - topic assignments.
e.g.
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (65,0.138889)
(54,0.111111) (9,0.0833333) (21,0.0833333) (27,0.0833333) (87,0.0833333)
(29,0.0555556) (52,0.0555556) (56,0.0555556) (72,0.0555556)
However, in streaming mode, it seems to be returning to me document word to
topic assignments similar to batch mode's lda.worToTop.txt.
e.g.
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,87)
(past,87) (months,72) (noticed,21) (guy,52) (surf,27) (magazine,87)
(published,10) (finally,21) (run,21) (copyright,54) (surfboards,27) (rights,54)
(reserved,54) (june,72) (launches,73) (improved,9) (site,54) (order,73)
(custom,56) (surfboards,27) (online,52) (improvements,9) (top,9) (selling,6)
(models,29) (middot,65) (rocket,44) (fish,56) (middot,65) (speed,65) (egg,95)
(middot,65) (classic,29) (middot,65) (squash,55)
Can I make streaming mode return doc - topic assignments?
If not, can I compute the doc-topic assignments easily from the doc word - topic
assignment output?
I would like to call the streaming mode from a Java process.
Please help. :)
Thanks!
-John
./runLDA.sh 1 "" train default "/user/chengmingbo/input/ydir.txt" "/user/chengmingbo/output" -1 100 5 "/user/chengmingbo/LDALibs.jar" 3
I find if the output directory take the last '/', the training process will delete all file generate by formatter.
I don't know how to handle the exception i encountered
hadoop version:0.20.2
os:Linux version 2.6.18-164.el5 ([email protected]) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Thu Sep 3 03:28:30 EDT 2009
the information is as follows:
Deleted hdfs://h253014:9000/user/chengmingbo/output
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 134
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
stderr logs
oop/jobcache/job_201205161512_0159/attempt_201205161512_0159_m_000001_0/work/out/learnTopics.*
W0517 14:36:01.712434 20074 Controller.cpp:100] ----------------------------------------------------------------------
W0517 14:36:01.712734 20074 Controller.cpp:115] You have chosen multi machine training mode
W0517 14:36:01.713021 20074 Unigram_Model_Training_Builder.cpp:60] Initializing Dictionary from lda.dict.dump
W0517 14:36:01.713249 20074 Unigram_Model_Training_Builder.cpp:62] Dictionary Initialized
W0517 14:36:01.713443 20074 Unigram_Model_Trainer.cpp:49] Initializing Word-Topic counts table from docs lda.wor, lda.top using 0 words & 100 topics.
W0517 14:36:01.713533 20074 Unigram_Model_Trainer.cpp:53] Initialized Word-Topic counts table
W0517 14:36:01.713568 20074 Unigram_Model_Trainer.cpp:57] Initializing Alpha vector from Alpha_bar = 50
W0517 14:36:01.713608 20074 Unigram_Model_Trainer.cpp:60] Alpha vector initialized
W0517 14:36:01.713624 20074 Unigram_Model_Trainer.cpp:63] Initializing Beta Parameter from specified Beta = 0.01
W0517 14:36:01.713641 20074 Unigram_Model_Trainer.cpp:67] Beta param initialized
Outgoing.cpp:424: Ice::ObjectNotExistException:
object does not exist:
identity: `DM_Server_0'
facet:
operation: ice_isA
terminate called after throwing an instance of 'IceUtil::Exception'
what(): Outgoing.cpp:424: IceUtil::Exception
*** Aborted at 1337236561 (unix time) try "date -d @1337236561" if you are using GNU date ***
PC: @ 0x3293430265 (unknown)
*** SIGABRT (@0x1f400004e6a) received by PID 20074 (TID 0x2b77acda63b0) from PID 20074; stack trace: ***
@ 0x329400e7c0 (unknown)
@ 0x3293430265 (unknown)
@ 0x3293431d10 (unknown)
@ 0x3299cbec44 (unknown)
@ 0x3299cbcdb6 (unknown)
@ 0x3299cbcde3 (unknown)
@ 0x3299cbceca (unknown)
@ 0x44470e DM_Client::add_server()
@ 0x44509e DM_Client::DM_Client()
@ 0x4a3f95 Unigram_Model_Synchronizer_Helper::Unigram_Model_Synchronizer_Helper()
@ 0x4a39b5 Unigram_Model_Synchronized_Training_Builder::create_execution_strategy()
@ 0x447d04 Model_Director::build_model()
@ 0x4396ab main
@ 0x329341d994 (unknown)
@ 0x40d8c9 (unknown)
/d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/attempt_201205161512_0159_m_000001_0/work/./LDA.sh: line 103: 20074 已放弃 $LDALIBS/learntopics --model=$model --iter=$iters --topics=$topics --servers="$servers" --chkptdir="${MY_INP_DIR}" $flags 1>&2
Synch directory: hdfs://h253014:9000/user/chengmingbo/output/temporary/synchronize/learntopics
Num of map tasks: 3
Found 3 items
-rw-rw-r-- 3 hadoop supergroup 9 2012-05-17 14:36 /user/chengmingbo/output/temporary/synchronize/learntopics/0
-rw-rw-r-- 3 hadoop supergroup 9 2012-05-17 14:36 /user/chengmingbo/output/temporary/synchronize/learntopics/1
-rw-rw-r-- 3 hadoop supergroup 9 2012-05-17 14:36 /user/chengmingbo/output/temporary/synchronize/learntopics/2
Num of clients done: 3
All clients done!
learntopics returned an error code of 134
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 134
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
syslog logs
2012-05-17 14:35:49,106 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2012-05-17 14:35:49,272 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/distcache/-5116373793315176501_-731644028_1402464899/h253014/user/chengmingbo/LDALibs.jar <- /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/attempt_201205161512_0159_m_000001_0/work/LDALibs
2012-05-17 14:35:49,289 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/jars/functions.sh <- /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/attempt_201205161512_0159_m_000001_0/work/functions.sh
2012-05-17 14:35:49,299 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/jars/job.jar <- /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/attempt_201205161512_0159_m_000001_0/work/job.jar
2012-05-17 14:35:49,308 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/jars/LDA.sh <- /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/attempt_201205161512_0159_m_000001_0/work/LDA.sh
2012-05-17 14:35:49,316 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/jars/.job.jar.crc <- /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/attempt_201205161512_0159_m_000001_0/work/.job.jar.crc
2012-05-17 14:35:49,397 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
2012-05-17 14:35:49,578 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library is available
2012-05-17 14:35:49,578 INFO org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library loaded
2012-05-17 14:35:49,586 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2012-05-17 14:35:49,694 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec [/d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/attempt_201205161512_0159_m_000001_0/work/./LDA.sh, 1, , 100, 5]
2012-05-17 14:35:49,775 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=0/1
2012-05-17 14:36:06,632 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!
2012-05-17 14:36:06,713 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2012-05-17 14:36:06,717 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 134
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
2012-05-17 14:36:06,721 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
I followed the setup instructions for single machine, and when I try the streaming mode example, if I input the same string multiple times, I get a different topic categorization every time:
java Tokenizer | ../learntopics -teststream -dumpprefix=../ut_out/lda --topics=100 --dictionary=../ut_out/lda.dict.dump
W0720 12:28:20.250313 2803 Controller.cpp:115] ----------------------------------------------------------------------
W0720 12:28:20.250712 2803 Controller.cpp:117] Log files are being stored at /lda/ut_out/learnTopics.*
W0720 12:28:20.250731 2803 Controller.cpp:119] ----------------------------------------------------------------------
W0720 12:28:20.251055 2803 Controller.cpp:140] You have chosen single machine testing mode
W0720 12:28:20.251379 2803 Unigram_Model_Streaming_Builder.cpp:56] Initializing global dictionary from ../ut_out/lda.dict.dump
W0720 12:28:20.308131 2803 Unigram_Model_Streaming_Builder.cpp:59] Dictionary initialized and has 17208
W0720 12:28:20.308279 2803 Unigram_Model_Streaming_Builder.cpp:86] Estimating the words that will fit in 2048 MB
W0720 12:28:20.408761 2803 Unigram_Model_Streaming_Builder.cpp:91] 17208 will fit in 1.06012 MB of memory
W0720 12:28:20.408906 2803 Unigram_Model_Streaming_Builder.cpp:93] Initializing Local Dictionary from ../ut_out/lda.dict.dump with 17208 words.
W0720 12:28:20.491570 2803 Unigram_Model_Streaming_Builder.cpp:122] Local Dictionary Initialized. Size: 34416
W0720 12:28:20.494669 2803 Unigram_Model_Streamer.cpp:64] Initializing Word-Topic counts table from dump ../ut_out/lda.ttc.dump using 17208 words & 100 topics.
W0720 12:28:20.549022 2803 Unigram_Model_Streamer.cpp:88] Initialized Word-Topic counts table
W0720 12:28:20.549149 2803 Unigram_Model_Streamer.cpp:91] Initializing Alpha vector from dumpfile ../ut_out/lda.par.dump
W0720 12:28:20.549247 2803 Unigram_Model_Streamer.cpp:94] Alpha vector initialized
W0720 12:28:20.549309 2803 Unigram_Model_Streamer.cpp:97] Initializing Beta Parameter from specified Beta = 0.01
W0720 12:28:20.549383 2803 Unigram_Model_Streamer.cpp:101] Beta param initialized
W0720 12:28:20.557430 2803 Testing_Execution_Strategy.cpp:64] Starting Parallel testing Pipeline
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports watch out jeremy sherwin is here over the past six months you may have noticed this guy in every surf magazine published jeremy is finally getting his run more.. copyright surfboards 2004 all rights reserved june 6 2004 new launches it s new and improved site you can now order custom surfboards online more improvements to come.. top selling models middot rocket fish middot speed egg middot classic middot squash
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,83) (past,86) (months,77) (noticed,15) (guy,93) (surf,35) (magazine,86) (published,92) (finally,49) (run,21) (copyright,62) (surfboards,27) (rights,90) (reserved,59) (june,63) (launches,26) (improved,40) (site,26) (order,72) (custom,36) (surfboards,11) (online,68) (improvements,67) (top,29) (selling,82) (models,30) (middot,62) (rocket,23) (fish,67) (middot,35) (speed,29) (egg,2) (middot,22) (classic,58) (middot,69) (squash,67)
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports watch out jeremy sherwin is here over the past six months you may have noticed this guy in every surf magazine published jeremy is finally getting his run more.. copyright surfboards 2004 all rights reserved june 6 2004 new launches it s new and improved site you can now order custom surfboards online more improvements to come.. top selling models middot rocket fish middot speed egg middot classic middot squash
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,93) (past,56) (months,11) (noticed,42) (guy,29) (surf,73) (magazine,21) (published,19) (finally,84) (run,37) (copyright,98) (surfboards,24) (rights,15) (reserved,70) (june,13) (launches,26) (improved,91) (site,80) (order,56) (custom,73) (surfboards,62) (online,70) (improvements,96) (top,81) (selling,5) (models,25) (middot,84) (rocket,27) (fish,36) (middot,5) (speed,46) (egg,29) (middot,13) (classic,57) (middot,24) (squash,95)
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports watch out jeremy sherwin is here over the past six months you may have noticed this guy in every surf magazine published jeremy is finally getting his run more.. copyright surfboards 2004 all rights reserved june 6 2004 new launches it s new and improved site you can now order custom surfboards online more improvements to come.. top selling models middot rocket fish middot speed egg middot classic middot squash
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,82) (past,45) (months,14) (noticed,67) (guy,34) (surf,64) (magazine,43) (published,50) (finally,87) (run,8) (copyright,76) (surfboards,78) (rights,88) (reserved,84) (june,3) (launches,51) (improved,54) (site,99) (order,32) (custom,60) (surfboards,76) (online,68) (improvements,39) (top,12) (selling,26) (models,86) (middot,94) (rocket,39) (fish,95) (middot,70) (speed,34) (egg,78) (middot,67) (classic,1) (middot,97) (squash,2)
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports watch out jeremy sherwin is here over the past six months you may have noticed this guy in every surf magazine published jeremy is finally getting his run more.. copyright surfboards 2004 all rights reserved june 6 2004 new launches it s new and improved site you can now order custom surfboards online more improvements to come.. top selling models middot rocket fish middot speed egg middot classic middot squash
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,17) (past,92) (months,52) (noticed,56) (guy,1) (surf,80) (magazine,86) (published,41) (finally,65) (run,89) (copyright,44) (surfboards,19) (rights,40) (reserved,29) (june,31) (launches,17) (improved,97) (site,71) (order,81) (custom,75) (surfboards,9) (online,27) (improvements,67) (top,56) (selling,97) (models,53) (middot,86) (rocket,65) (fish,6) (middot,83) (speed,19) (egg,24) (middot,28) (classic,71) (middot,32) (squash,29)
Hi,
I am using Y!LDA in Hadoop with 3 computers.
I got the results of "train mode" and found it a little bit confusion. I ran the script with --topics=20, and found that the files "lda.docToTop.txt, lda.topToWor.txt, lda.worToTop.txt" exist in 3 different directories. Each directory has 20 topics. Is it correct?
What am I supposed to get the "test" result from the "trained model"? Still 3 different directories?
Hope somebody can help me. Thanks a lot!
Yanbo
I have followed single machine setup and run successfully. However, I got a similar problem with yuvalye while run on multi machine, but not the same.
A checkpointed directory exists. Do you want to start from this checkpoint?
[: 71: ==: unexpected operator
Deleted hdfs://xx:xx/xx/xx/yahoo_lda/output_0
557 2011-12-08 14:30:52,576 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201112081253_0005_r_000000_0: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
558 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
559 at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
560 at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:137)
561 at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:473)
562 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
From yuvalye's issue description we can see that is the "Permission denied" problem. But I have no idea what's problem I have got, could some one help me?
Thanks
YimingChen
After a couple of manual twists, I got the Yahoo! LDA distributed setup running on top of Hadoop 1.0.4. It is a simple 2-nodes Hadoop configuration. When I executed the runLDA.sh, I used "2" for the number of machines. Everything ran OK and I also checked the logs from Hadoop to make sure that everything looked normal. After it completed the run, I got two output directories and all the files as described in Yahoo! LDA documentation. So far so good...
Then, I continued to run a single machine setup against the same corpus / documents with the same numbers of topics and iterations. After it completed the run, I got all the files as described in Yahoo! LDA documentation too. So far so good...
However, when I started to compare the results between the distributed and single machine setup, they look pretty different to me. Though the number of topics are the same, the topics look very different. The document to topic outputs look pretty different too.
Here is an example.
Below is the topics that have the word "portion" from the distributed setup.
Topic 0: (portion,0.150299) (hous,0.129272) (top,0.113109) (member,0.0838274) (bottom,0.0588456) (featur,0.0582807) (configur,0.0468569) (materi,0.0431222) (seal,0.0416157) (cover,0.0299408) (perimet,0.0279322) (mechan,0.0273359) (port,0.0266141) (interior,0.0252332) (caviti,0.0248566) (membran,0.0238209) (factor,0.0233815) (electron,0.0233501) (latch,0.0211846) (coupl,0.0211219)
Topic 7: (magnet,0.173736) (portion,0.115939) (surfac,0.106755) (direct,0.0771537) (guid,0.0469494) (field,0.0426735) (coil,0.0386558) (side,0.0377519) (face,0.0375079) (shield,0.0358291) (medium,0.0347386) (end,0.0340499) (layer,0.0312805) (form,0.030477) (part,0.029817) (main,0.0278081) (pole,0.0272916) (section,0.0268181) (shape,0.0247949) (head,0.0199737)
Topic 12: (member,0.151508) (roller,0.0885818) (form,0.0709548) (imag,0.0678395) (sheet,0.0652739) (develop,0.0566723) (fix,0.0477156) (rotat,0.0468222) (toner,0.0461121) (portion,0.0448522) (direct,0.0416338) (belt,0.0406029) (side,0.0360559) (transfer,0.0329291) (surfac,0.0320128) (posit,0.0302947) (drum,0.0264349) (unit,0.0246825) (press,0.0245451) (apparatu,0.0244763)
Topic 18: (line,0.231612) (displai,0.110021) (electrod,0.0977109) (pixel,0.0831908) (crystal,0.0775907) (liquid,0.0751507) (panel,0.0452805) (portion,0.0372104) (plural,0.0269503) (direct,0.0253603) (substrat,0.0251203) (align,0.0219703) (form,0.0201103) (connect,0.0197803) (common,0.0188203) (view,0.0173602) (arrang,0.0172402) (gate,0.0171002) (polar,0.0165002) (respect,0.0159202)
Topic 47: (region,0.433173) (portion,0.0659125) (semiconductor,0.0466138) (structur,0.0431049) (gate,0.0370144) (implant,0.0332915) (sourc,0.0312804) (diffus,0.0297256) (dope,0.0259315) (present,0.0252183) (illustr,0.0249045) (charg,0.0243625) (zone,0.0241771) (form,0.0241057) (trench,0.0236636) (channel,0.0228648) (impur,0.0218521) (ion,0.0212958) (concentr,0.0208964) (drain,0.0206111)
Topic 54: (transfer,0.293593) (properti,0.0975047) (pre,0.0648832) (medium,0.0623078) (identifi,0.0547478) (instruct,0.0422863) (class,0.0393509) (diagram,0.0378002) (process,0.0360002) (sensit,0.0304894) (portion,0.0298802) (set,0.0274987) (match,0.026391) (inform,0.0247294) (oper,0.0234002) (classifi,0.0228187) (number,0.0221264) (determin,0.0216556) (label,0.021351) (step,0.0211848)
Topic 57: (chip,0.0976907) (wire,0.0929568) (electr,0.0724923) (connect,0.0669996) (pad,0.0622883) (conduct,0.0585963) (substrat,0.0582906) (circuit,0.0577923) (surfac,0.0551535) (packag,0.0491625) (semiconductor,0.047577) (board,0.0425033) (plural,0.0378487) (bond,0.0329676) (form,0.0302948) (contact,0.0297399) (portion,0.0287886) (mount,0.0269879) (compon,0.0266368) (side,0.0252325)
Topic 92: (portion,0.1064) (bodi,0.0994396) (side,0.068734) (end,0.0669684) (support,0.0604946) (posit,0.0601663) (assembl,0.0473205) (mount,0.0467772) (wall,0.0426122) (member,0.0421821) (surfac,0.0403939) (engag,0.0394771) (view,0.0380171) (plate,0.0377681) (connect,0.0358893) (front,0.035878) (open,0.0349726) (cover,0.032709) (attach,0.0325619) (rotat,0.0312377)
Below is the topics that have the word "portion" from the single machine setup.
Topic 6: (electr,0.158046) (structur,0.127875) (conduct,0.106806) (circuit,0.0866087) (connect,0.0616967) (interconnect,0.0502754) (connector,0.0435378) (integr,0.0398189) (mechan,0.0348396) (conductor,0.0320855) (illustr,0.0320388) (coupl,0.0315253) (ground,0.0308562) (contact,0.0285066) (portion,0.0271684) (fuse,0.0240252) (carrier,0.0237296) (isol,0.0217379) (substrat,0.0213022) (shown,0.017521)
Topic 7: (portion,0.270355) (side,0.108089) (end,0.0707633) (bodi,0.059409) (surfac,0.0554205) (direct,0.0473438) (form,0.0400563) (section,0.037248) (guid,0.0355345) (view,0.0340486) (upper,0.0292069) (shown,0.0286381) (shape,0.0268464) (main,0.0264341) (front,0.026107) (open,0.0252183) (extend,0.0208885) (lower,0.0202913) (hole,0.019694) (case,0.0184072)
Topic 24: (region,0.621829) (characterist,0.0302271) (portion,0.0278799) (laser,0.0274319) (direct,0.0259627) (overlap,0.0255685) (vertic,0.025246) (differ,0.0227913) (edg,0.020677) (posit,0.0195661) (width,0.018509) (adjac,0.0180611) (plural,0.0169681) (standard,0.0163947) (layout,0.0156422) (beam,0.0144955) (extend,0.0142984) (background,0.0139042) (specif,0.0122916) (abov,0.0122558)
Topic 82: (film,0.161729) (semiconductor,0.126738) (form,0.105283) (gate,0.094141) (insul,0.0559734) (silicon,0.0484004) (oxid,0.0421204) (electrod,0.0409827) (sourc,0.0344884) (transistor,0.0342963) (substrat,0.0337643) (drain,0.0326857) (conduct,0.0289546) (surfac,0.0261471) (region,0.0252235) (portion,0.0246842) (thin,0.0241079) (impur,0.0209088) (etch,0.0197341) (diffus,0.019638)
Topic 89: (electrod,0.309474) (wire,0.171766) (electr,0.0730991) (resist,0.0727857) (connect,0.0579722) (insul,0.0379882) (form,0.037917) (protect,0.0311655) (present,0.0265505) (discharg,0.0264081) (view,0.0192577) (section,0.0184458) (appli,0.0175627) (illustr,0.0160386) (contact,0.0155543) (addit,0.015127) (portion,0.0137311) (dispos,0.013546) (shown,0.0130902) (side,0.0125204)
Should both distributed and single machine setup have similar results? Or, is there any good way to compare the results between the distributed and single machine setup systematically?
Thanks!
In the web documentation for single machine, it references a "--maxmem=[size in MB]" command option. However, it seems that the command option is really -maxmemory=[size in MB]. Can someone confirm this. The command line help indicates "-maxmemory (The max memory that can be used) type: int32 default: 2048"
Outgoing.cpp:424: Ice::ObjectNotExistException:
object does not exist:
identity: `DM_Server_19'
facet:
operation: ice_isA
terminate called after throwing an instance of 'std::bad_alloc'
what(): St9bad_alloc
*** Aborted at 1325992166 (unix time) try "date -d @1325992166" if you are using GNU date ***
PC: @ 0x3246430265 (unknown)
*** SIGABRT (@0xc3b8000042ff) received by PID 17151 (TID 0x2b0867e588c0) from PID 17151; stack trace: ***
@ 0x3246c0eb10 (unknown)
@ 0x3246430265 (unknown)
@ 0x3246431d10 (unknown)
@ 0x2b0867c13d14 (unknown)
@ 0x2b0867c11e16 (unknown)
@ 0x2b0867c11e43 (unknown)
@ 0x2b0867c11f2a (unknown)
@ 0x2b0867c12239 (unknown)
@ 0x2b0867c122f9 (unknown)
@ 0x44cafe TypeTopicCounts::TypeTopicCounts()
@ 0x445d10 main
@ 0x324641d994 (unknown)
@ 0x40cda9 (unknown)
/data/1/mr/local/taskTracker/jwang/jobcache/job_201112201444_4493/attempt_201112201444_4493_m_000010_0/work/./LDA.sh: line 160: 17151 Aborted $LDALIBS/Merge_Topic_Counts --topics=$topics --clientid=${mapred_task_partition} --servers="$servers" --globaldictionary="lda.dict.dump.global"
Synch directory: hdfs://hadooprsonn001.bo1.shopzilla.sea/user/jwang/workspace/ldanew/temporary/synchronize/merge_topcnts
Num of map tasks: 20
Found 1 items
-rw-r--r-- 3 mapred supergroup 10 2012-01-07 19:09 /user/jwang/workspace/ldanew/temporary/synchronize/merge_topcnts/10
Num of clients done: 1
Sleeping
Num of clients done: 15
Sleeping
Num of clients done: 20
All clients done!
put: File lda.ttc.dump does not exist.
dput lda.ttc.dump returned an error code of 255
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 255
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.