sudar / yahoo_lda Goto Github PK

Yahoo!'s topic modelling framework using Latent Dirichlet Allocation

License: Apache License 2.0

C 16.17% C++ 77.35% Java 0.29% Shell 2.60% JavaScript 3.22% Perl 0.37%

yahoo_lda's Introduction

The Yahoo_LDA project uses several 3rd party open source libraries and tools.

This file summarizes the tools used, their purpose, and the licenses under which they're released. 

Except as specifically stated below, the 3rd party software packages are not distributed as part of

this project, but instead are separately downloaded and built on the developer’s machine as a 

pre-build step. 

* Ice-3.4.1 (GNU GENERAL PUBLIC LICENSE)
* An efficient inter process communication framework which is used for the distributed storage of (topic, word) tables.
* http://www.zeroc.com/

* cppunit-1.12.1 (GNU LESSER GENERAL PUBLIC LICENSE)
* C++ unit testing framework. We use this for unit tests.
* http://cppunit.sourceforge.net

* glog-0.3.0 (BSD)
* Logfile generation (Google's log library).
* http://code.google.com/p/google-glog/

* mcpp-2.7.2 (BSD)
* C++ preprocessor
* http://mcpp.sourceforge.net/

* tbb22_20090809oss (GNU GENERAL PUBLIC LICENSE)
* Intel Threading Building Blocks. Multithreaded processing library. Much easier to use than pthreads. We use the pipeline class.
* http://threadingbuildingblocks.org

* bzip2-1.0.5 (BSD)
* Data compression
* http://www.bzip.org/

* gflags-1.2 (BSD)
* Google's flag processing library (used for commandline options) 
* http://code.google.com/p/google-gflags/

* protobuf-2.2.0a (BSD)
* Protocol buffers (used for serializing data to disk and as internal key data structure). Google's serialization library 
* http://code.google.com/p/protobuf/

* boost-1.46.0 (Boost Software License - Version 1.0 - August 17th, 2003)
* Boost Libraries (various datatypes)
* http://www.boost.org/

Please refer to the html or pdf documentation present at docs/html/index.html & docs/latex/refman.pdf respectively for more information.

yahoo_lda's People

Stargazers

Watchers

Forkers

rand ivanistheone liuyizhe nandigama deanmalmgren jxchen john1024byte lijinhui antoine-tran redspade jduprey chengmingbo gattis pmadhyastha shravanmn robdefeo duyvk mrorii shelocks xunyuw linussh xuyuandong buptlishantao guowt tmacmilan wtest codevampireg qywang1983 bikestra volkangurel mengwenkui fancyspeed ywl kevinew wowgeeker anfeng mvoelske ahujack imclab guomin wangby echo1111 yiiwood thjashin vver cxysteven alexrives njuhugn baokunguo listentowindy starsnet83 iamblinking alei76 dongyu1990 anilkumar-k luyee furaoing chenmoshushi linshifei wells-chen ashhher3 rafshaf albert--hong caomw sandy4321 604254229 wwf5067 lovetimil zbxzc35 davidchu201 jockeyyan fengchu0618 zhoujialinmumu leezqcst zgsxwsdxg maxy218 hussien ayanc18 afcarl buptygz wangdxf bambangdw knowledgehacker linhx13 zhongyunuestc dnuang luxin-tian mostafa-at-github

yahoo_lda's Issues

Choosing count of topics

How do I decide how many topics should be returned by LDA? Does Expectation Maximization helps in determining topics count. If Yes, how do I calculate it.

A synchronization problem and ICE networking error

When I am running Yahoo LDA on my hadoop cluster, I found the following problems:

permission denied for executable contained in jar package
To resolve this issue, I added chmod 755 $LDALibs/* at Formatter.sh and LDA.sh
synchronization problem of global/lda.dict.dump
I've found that before the process 0 finished writing global/lda.dict.dump if other processes tried to run the following script:

${HADOOP_CMD} dfs -get ${mapred_output_dir}/global/lda.dict.dump lda.dict.dump.global

it cannot download the file and whole process is going crashed. So, I put the synchronization code such as wait_for 60 ${mapred_output_dir}/global/lda.dict.dump.

The critical problem of multi-machine of Yahoo LDA

Finally, I got the following problem, this is not related with running script, so how can I recover this situation?

1020 03:57:06.626588 20423 Merge_Topic_Counts.cpp:103] Initializing global dictionary from lda.dict.dump.global
W1020 03:57:11.659412 20423 Merge_Topic_Counts.cpp:105] global dictionary Initialized
terminate called after throwing an instance of 'Ice::ConnectionLostException'
what(): TcpTransceiver.cpp:248: Ice::ConnectionLostException:
connection lost: Connection reset by peer

Should I modify LDA.sh script to check the error code of each module execution and repeat unless the error code is success?

Thank you!

a question about different topics from same corpus

   I ran the train mode on a corpus of size about 1G. I tried twice, each with 500 topics and 500 iterations. But I got two quite different results. I means the 2 files "lad.topToWor.txt" from 2 train results are quite different. I compared the words on each topic( ignored  weight) . Only 250 topics on 2 files are similar ( more than 10 words are matched, which I can say that 2 topics in 2 files are similar )
   This means that I will get quite different results from a random initialization. Is there a way that I can get a stable result?

increase the number of topics? or increase the iterations? I tried 1000 iterations but no big change.

  Thanks!

Yanbo

IceUtil::Exception

DM_Client::add_server: DM_Server_0:default -h 10.101.173.51 -p 25342
Outgoing.cpp:424: Ice::ObjectNotExistException:
object does not exist:
identity: `DM_Server_0'
facet:
operation: ice_isA
terminate called after throwing an instance of 'IceUtil::Exception'
what(): Outgoing.cpp:424: IceUtil::Exception
*** Aborted at 1325947898 (unix time) try "date -d @1325947898" if you are using GNU date ***
PC: @ 0x3d0dc30265 (unknown)
*** SIGABRT (@0xc3b800005280) received by PID 21120 (TID 0x2b1b54a9d8e0) from PID 21120; stack trace: ***
@ 0x3d0e40eb10 (unknown)
@ 0x3d0dc30265 (unknown)
@ 0x3d0dc31d10 (unknown)
@ 0x2b1b54858d14 (unknown)
@ 0x2b1b54856e16 (unknown)
@ 0x2b1b54856e43 (unknown)
@ 0x2b1b54856f2a (unknown)
@ 0x43faa4 DM_Client::add_server()
@ 0x44042e DM_Client::DM_Client()
@ 0x49da95 Unigram_Model_Synchronizer_Helper::Unigram_Model_Synchronizer_Helper()
@ 0x49d4b5 Unigram_Model_Synchronized_Training_Builder::create_execution_strategy()
@ 0x447d04 Model_Director::build_model()
@ 0x4396ab main
@ 0x3d0dc1d994 (unknown)
@ 0x40d8c9 (unknown)
/data/1/mr/local/taskTracker/jwang/jobcache/job_201112201444_4463/attempt_201112201444_4463_m_000000_0/work/./LDA.sh: line 103: 21120 Aborted $LDALIBS/learntopics --model=$model --iter=$iters --topics=$topics --servers="$servers" --chkptdir="${MY_INP_DIR}" $flags 1>&2
Synch directory: hdfs://hadooprsonn001.bo1.shopzilla.sea/user/jwang/workspace/ldanew/temporary/synchronize/learntopics
Num of map tasks: 10
Found 8 items
-rw-r--r-- 3 mapred supergroup 9 2012-01-07 06:51 /user/jwang/workspace/ldanew/temporary/synchronize/learntopics/0
-rw-r--r-- 3 mapred supergroup 9 2012-01-07 06:51 /user/jwang/workspace/ldanew/temporary/synchronize/learntopics/1
-rw-r--r-- 3 mapred supergroup 9 2012-01-07 06:51 /user/jwang/workspace/ldanew/temporary/synchronize/learntopics/3
-rw-r--r-- 3 mapred supergroup 9 2012-01-07 06:51 /user/jwang/workspace/ldanew/temporary/synchronize/learntopics/4
-rw-r--r-- 3 mapred supergroup 9 2012-01-07 06:51 /user/jwang/workspace/ldanew/temporary/synchronize/learntopics/5
-rw-r--r-- 3 mapred supergroup 9 2012-01-07 06:51 /user/jwang/workspace/ldanew/temporary/synchronize/learntopics/6
-rw-r--r-- 3 mapred supergroup 9 2012-01-07 06:51 /user/jwang/workspace/ldanew/temporary/synchronize/learntopics/7
-rw-r--r-- 3 mapred supergroup 9 2012-01-07 06:51 /user/jwang/workspace/ldanew/temporary/synchronize/learntopics/8
Num of clients done: 7
Sleeping

unable to run formatter

I run

./runLDA.sh 1 "data" train default "/user/jwang/simulation/data/offers/current" "/user/jwang/workspace/lda/" 3244 1000 5 /user/jwang/software/YLDA/LDALibs.jar 20 "/user/jwang/workspace/ldatemp"

There are text files in /user/jwang/simulation/data/offers/current, where each line is a document.

It invoked a streaming job, where mapper run successfully. However, the reducers failed. Here is the error associated with one of reducers:

java.io.IOException: subprocess exited with error code 1
R/W/S=540/0/0 in:77=540/7 [rec/s] out:0=0/7 [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
HOST=null
USER=root
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |null|
Date: Thu Jan 05 16:04:36 PST 2012
Broken pipe
at org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:131)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:469)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)

Any idea?

Multi machine hadoop setup

I have an hadoop distribution running locally.
I followed the multi machine setup in the distribution,and uploaded LDALibs.jar to the HDFS and also put there a ut_out folder.
When running the script -
./runLDA.sh 1 "" train default ./ut_out/ydir_1k.txt ./ld 8000 25 500 hdfs://localhost:9000/.../LDALibs.jar 1 I
it fails.
This is the output I get:

/hadoop/bin/hadoop jar /hadoop/hadoop-streaming.jar -Dmapred.job.queue.name=default -Dmapred.reduce.tasks.speculative.execution=false -Dmapred.job.reduce.memory.mb=8000 -Dmapred.reduce.tasks=1 -Dmapred.child.ulimit=8001000 -Dmapred.task.timeout=1800000 -Dmapred.reduce.max.attempts=1 -Dmapreduce.job.acl-view-job=shravanm,smola -input ./ut_out/ydir_1k.txt -output ./ld_0 -cacheArchive hdfs://1..../LDALibs.jar#LDALibs -mapper /bin/cat -reducer 'Formatter.sh 1 " " ' -file Formatter.sh -file functions.sh -numReduceTasks 1
11/07/27 14:47:56 WARN streaming.StreamJob: -cacheArchive option is deprecated, please use -archives instead.
packageJobJar: [Formatter.sh, functions.sh, /tmp/hadoop-datastore/hadoop/hadoop-unjar1697472003615487337/] [] /tmp/streamjob6284543710483930588.jar tmpDir=null
11/07/27 14:47:57 INFO mapred.FileInputFormat: Total input paths to process : 1
11/07/27 14:47:57 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-datastore/hadoop/mapred/local]
11/07/27 14:47:57 INFO streaming.StreamJob: Running job: job_201107240937_0039
11/07/27 14:47:57 INFO streaming.StreamJob: To kill this job, run:
11/07/27 14:47:57 INFO streaming.StreamJob: /hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=10.149.19.120:9001 -kill job_201107240937_0039
11/07/27 14:47:57 INFO streaming.StreamJob: Tracking URL: http://ip:50030/jobdetails.jsp?jobid=job_201107240937_0039
11/07/27 14:47:58 INFO streaming.StreamJob: map 0% reduce 0%
11/07/27 14:48:05 INFO streaming.StreamJob: map 100% reduce 0%
11/07/27 14:48:14 INFO streaming.StreamJob: map 100% reduce 33%
11/07/27 14:48:17 INFO streaming.StreamJob: map 100% reduce 0%
11/07/27 14:48:20 INFO streaming.StreamJob: map 100% reduce 100%
11/07/27 14:48:20 INFO streaming.StreamJob: To kill this job, run:
11/07/27 14:48:20 INFO streaming.StreamJob: /home/snoop/hadoop/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=ip:9001 -kill job_201107240937_0039
11/07/27 14:48:20 INFO streaming.StreamJob: Tracking URL: http://ip:50030/jobdetails.jsp?jobid=job_201107240937_0039
11/07/27 14:48:20 ERROR streaming.StreamJob: Job not Successful!
11/07/27 14:48:20 INFO streaming.StreamJob: killJob...
Streaming Job Failed!
exit_code=1
set +x
Unable to run Formatter on your corpus

and the error log for the task that fails is:

model=1 flags=" " trained_data=""
/tmp/hadoop-datastore/hadoop/mapred/local/taskTracker/jobcache/job_201107240937_0039/attempt_201107240937_0039_r_000000_0/work/./Formatter.sh: line 31: /tmp/hadoop-datastore/hadoop/mapred/local/taskTracker/jobcache/job_201107240937_0039/attempt_201107240937_0039_r_000000_0/work/LDALibs/formatter: Permission denied
Formatter returned an error code of 126
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 126
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:137)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:473)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient).
log4j:WARN Please initialize the log4j system properly.

Any ideas/thoughs what wrong?
Thanks
Yuval

Batch vs. Streaming Classification - "Using the Model"

Hello,

I wasn't sure which was the best forum to post this issue/question to - the yahoo groups or hear. It seems issues have more activity than in the groups. (I've cross posted: http://tech.groups.yahoo.com/group/y_lda/message/15)

I'm a total newbie to LDA, so please forgive me if I don't quite formulate this
question concisely.

From the single machine instructions for "Using the Model"
(/Yahoo_LDA/docs/html/single__machine__usage.html#using_model) it indicates that
you can run in either batch OR streaming mode.

In batch mode, the output are several files: lda.docToTop.txt lda.topToWor.txt
lda.worToTop.txt

lda.docToTop.txt is what I like - document - topic assignments.
e.g.
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (65,0.138889)
(54,0.111111) (9,0.0833333) (21,0.0833333) (27,0.0833333) (87,0.0833333)
(29,0.0555556) (52,0.0555556) (56,0.0555556) (72,0.0555556)

However, in streaming mode, it seems to be returning to me document word to
topic assignments similar to batch mode's lda.worToTop.txt.
e.g.
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,87)
(past,87) (months,72) (noticed,21) (guy,52) (surf,27) (magazine,87)
(published,10) (finally,21) (run,21) (copyright,54) (surfboards,27) (rights,54)
(reserved,54) (june,72) (launches,73) (improved,9) (site,54) (order,73)
(custom,56) (surfboards,27) (online,52) (improvements,9) (top,9) (selling,6)
(models,29) (middot,65) (rocket,44) (fish,56) (middot,65) (speed,65) (egg,95)
(middot,65) (classic,29) (middot,65) (squash,55)

Can I make streaming mode return doc - topic assignments?

If not, can I compute the doc-topic assignments easily from the doc word - topic
assignment output?

I would like to call the streaming mode from a Java process.

Please help. :)

Thanks!
-John

traning attempt failed run on hadoop

use command:

./runLDA.sh 1 "" train default "/user/chengmingbo/input/ydir.txt" "/user/chengmingbo/output" -1 100 5 "/user/chengmingbo/LDALibs.jar" 3

I find if the output directory take the last '/', the training process will delete all file generate by formatter.
I don't know how to handle the exception i encountered

hadoop version:0.20.2
os:Linux version 2.6.18-164.el5 ([email protected]) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Thu Sep 3 03:28:30 EDT 2009

the information is as follows:

[ttempt failed]

Deleted hdfs://h253014:9000/user/chengmingbo/output

/d2/hadoop_data/hadoop//bin/hadoop jar /d2/hadoop_data/hadoop//hadoop-streaming.jar -Dmapred.job.queue.name=default -Dmapred.map.tasks.speculative.execution=false -Dmapred.job.map.memory.mb=-1 -Dmapred.map.tasks=1 -Dmapred.child.ulimit=0 -Dmapred.task.timeout=1800000 -Dmapred.map.max.attempts=1 -Dmapred.max.tracker.failures=1 -Dmapreduce.job.acl-view-job=shravanm,smola -input /user/chengmingbo/output_0/input -output /user/chengmingbo/output -cacheArchive /user/chengmingbo/LDALibs.jar#LDALibs -mapper 'LDA.sh 1 " " 100 5' -file LDA.sh -file functions.sh -numReduceTasks 0
12/05/17 14:37:14 WARN streaming.StreamJob: -cacheArchive option is deprecated, please use -archives instead.
packageJobJar: [LDA.sh, functions.sh, /d2/hadoop_data/file_data_dir/hadoop_tmp_dir/hadoop-unjar455411245815288771/] [] /tmp/streamjob6164278369022733494.jar tmpDir=null
12/05/17 14:37:15 WARN snappy.LoadSnappy: Snappy native library is available
12/05/17 14:37:15 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/05/17 14:37:15 INFO snappy.LoadSnappy: Snappy native library loaded
12/05/17 14:37:15 INFO mapred.FileInputFormat: Total input paths to process : 3
12/05/17 14:37:15 INFO streaming.StreamJob: getLocalDirs(): [/d2/hadoop_data/file_data_dir/mapred_local_dir/]
12/05/17 14:37:15 INFO streaming.StreamJob: Running job: job_201205161512_0162
12/05/17 14:37:15 INFO streaming.StreamJob: To kill this job, run:
12/05/17 14:37:15 INFO streaming.StreamJob: /d2/hadoop_data/hadoop//bin/hadoop job -Dmapred.job.tracker=hdfs://h253014:9001/ -kill job_201205161512_0162
12/05/17 14:37:15 INFO streaming.StreamJob: Tracking URL: http://10.255.253.14:50030/jobdetails.jsp?jobid=job_201205161512_0162
12/05/17 14:37:16 INFO streaming.StreamJob: map 0% reduce 0%
12/05/17 14:37:42 INFO streaming.StreamJob: map 100% reduce 100%
12/05/17 14:37:42 INFO streaming.StreamJob: To kill this job, run:
12/05/17 14:37:42 INFO streaming.StreamJob: /d2/hadoop_data/hadoop//bin/hadoop job -Dmapred.job.tracker=hdfs://h253014:9001/ -kill job_201205161512_0162
12/05/17 14:37:42 INFO streaming.StreamJob: Tracking URL: http://10.255.253.14:50030/jobdetails.jsp?jobid=job_201205161512_0162
12/05/17 14:37:42 ERROR streaming.StreamJob: Job not successful. Error: NA
12/05/17 14:37:42 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
exit_code=1
set +x

[hadoop information]

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 134
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)

[LAST 4 K ]

stderr logs

oop/jobcache/job_201205161512_0159/attempt_201205161512_0159_m_000001_0/work/out/learnTopics.*
W0517 14:36:01.712434 20074 Controller.cpp:100] ----------------------------------------------------------------------
W0517 14:36:01.712734 20074 Controller.cpp:115] You have chosen multi machine training mode
W0517 14:36:01.713021 20074 Unigram_Model_Training_Builder.cpp:60] Initializing Dictionary from lda.dict.dump
W0517 14:36:01.713249 20074 Unigram_Model_Training_Builder.cpp:62] Dictionary Initialized
W0517 14:36:01.713443 20074 Unigram_Model_Trainer.cpp:49] Initializing Word-Topic counts table from docs lda.wor, lda.top using 0 words & 100 topics.
W0517 14:36:01.713533 20074 Unigram_Model_Trainer.cpp:53] Initialized Word-Topic counts table
W0517 14:36:01.713568 20074 Unigram_Model_Trainer.cpp:57] Initializing Alpha vector from Alpha_bar = 50
W0517 14:36:01.713608 20074 Unigram_Model_Trainer.cpp:60] Alpha vector initialized
W0517 14:36:01.713624 20074 Unigram_Model_Trainer.cpp:63] Initializing Beta Parameter from specified Beta = 0.01
W0517 14:36:01.713641 20074 Unigram_Model_Trainer.cpp:67] Beta param initialized
Outgoing.cpp:424: Ice::ObjectNotExistException:
object does not exist:
identity: `DM_Server_0'
facet:
operation: ice_isA
terminate called after throwing an instance of 'IceUtil::Exception'
what(): Outgoing.cpp:424: IceUtil::Exception
*** Aborted at 1337236561 (unix time) try "date -d @1337236561" if you are using GNU date ***
PC: @ 0x3293430265 (unknown)
*** SIGABRT (@0x1f400004e6a) received by PID 20074 (TID 0x2b77acda63b0) from PID 20074; stack trace: ***
@ 0x329400e7c0 (unknown)
@ 0x3293430265 (unknown)
@ 0x3293431d10 (unknown)
@ 0x3299cbec44 (unknown)
@ 0x3299cbcdb6 (unknown)
@ 0x3299cbcde3 (unknown)
@ 0x3299cbceca (unknown)
@ 0x44470e DM_Client::add_server()
@ 0x44509e DM_Client::DM_Client()
@ 0x4a3f95 Unigram_Model_Synchronizer_Helper::Unigram_Model_Synchronizer_Helper()
@ 0x4a39b5 Unigram_Model_Synchronized_Training_Builder::create_execution_strategy()
@ 0x447d04 Model_Director::build_model()
@ 0x4396ab main
@ 0x329341d994 (unknown)
@ 0x40d8c9 (unknown)
/d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/attempt_201205161512_0159_m_000001_0/work/./LDA.sh: line 103: 20074 已放弃 $LDALIBS/learntopics --model=$model --iter=$iters --topics=$topics --servers="$servers" --chkptdir="${MY_INP_DIR}" $flags 1>&2
Synch directory: hdfs://h253014:9000/user/chengmingbo/output/temporary/synchronize/learntopics
Num of map tasks: 3
Found 3 items
-rw-rw-r-- 3 hadoop supergroup 9 2012-05-17 14:36 /user/chengmingbo/output/temporary/synchronize/learntopics/0
-rw-rw-r-- 3 hadoop supergroup 9 2012-05-17 14:36 /user/chengmingbo/output/temporary/synchronize/learntopics/1
-rw-rw-r-- 3 hadoop supergroup 9 2012-05-17 14:36 /user/chengmingbo/output/temporary/synchronize/learntopics/2
Num of clients done: 3
All clients done!
learntopics returned an error code of 134
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 134
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)

syslog logs

2012-05-17 14:35:49,106 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2012-05-17 14:35:49,272 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/distcache/-5116373793315176501_-731644028_1402464899/h253014/user/chengmingbo/LDALibs.jar <- /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/attempt_201205161512_0159_m_000001_0/work/LDALibs
2012-05-17 14:35:49,289 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/jars/functions.sh <- /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/attempt_201205161512_0159_m_000001_0/work/functions.sh
2012-05-17 14:35:49,299 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/jars/job.jar <- /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/attempt_201205161512_0159_m_000001_0/work/job.jar
2012-05-17 14:35:49,308 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/jars/LDA.sh <- /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/attempt_201205161512_0159_m_000001_0/work/LDA.sh
2012-05-17 14:35:49,316 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/jars/.job.jar.crc <- /d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/attempt_201205161512_0159_m_000001_0/work/.job.jar.crc
2012-05-17 14:35:49,397 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
2012-05-17 14:35:49,578 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library is available
2012-05-17 14:35:49,578 INFO org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library loaded
2012-05-17 14:35:49,586 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2012-05-17 14:35:49,694 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec [/d2/hadoop_data/file_data_dir/mapred_local_dir/taskTracker/hadoop/jobcache/job_201205161512_0159/attempt_201205161512_0159_m_000001_0/work/./LDA.sh, 1, , 100, 5]
2012-05-17 14:35:49,775 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=0/1
2012-05-17 14:36:06,632 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!
2012-05-17 14:36:06,713 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2012-05-17 14:36:06,717 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 134
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
2012-05-17 14:36:06,721 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

streaming mode broken

I followed the setup instructions for single machine, and when I try the streaming mode example, if I input the same string multiple times, I get a different topic categorization every time:

java Tokenizer | ../learntopics -teststream -dumpprefix=../ut_out/lda --topics=100 --dictionary=../ut_out/lda.dict.dump
W0720 12:28:20.250313 2803 Controller.cpp:115] ----------------------------------------------------------------------
W0720 12:28:20.250712 2803 Controller.cpp:117] Log files are being stored at /lda/ut_out/learnTopics.*
W0720 12:28:20.250731 2803 Controller.cpp:119] ----------------------------------------------------------------------
W0720 12:28:20.251055 2803 Controller.cpp:140] You have chosen single machine testing mode
W0720 12:28:20.251379 2803 Unigram_Model_Streaming_Builder.cpp:56] Initializing global dictionary from ../ut_out/lda.dict.dump
W0720 12:28:20.308131 2803 Unigram_Model_Streaming_Builder.cpp:59] Dictionary initialized and has 17208
W0720 12:28:20.308279 2803 Unigram_Model_Streaming_Builder.cpp:86] Estimating the words that will fit in 2048 MB
W0720 12:28:20.408761 2803 Unigram_Model_Streaming_Builder.cpp:91] 17208 will fit in 1.06012 MB of memory
W0720 12:28:20.408906 2803 Unigram_Model_Streaming_Builder.cpp:93] Initializing Local Dictionary from ../ut_out/lda.dict.dump with 17208 words.
W0720 12:28:20.491570 2803 Unigram_Model_Streaming_Builder.cpp:122] Local Dictionary Initialized. Size: 34416
W0720 12:28:20.494669 2803 Unigram_Model_Streamer.cpp:64] Initializing Word-Topic counts table from dump ../ut_out/lda.ttc.dump using 17208 words & 100 topics.
W0720 12:28:20.549022 2803 Unigram_Model_Streamer.cpp:88] Initialized Word-Topic counts table
W0720 12:28:20.549149 2803 Unigram_Model_Streamer.cpp:91] Initializing Alpha vector from dumpfile ../ut_out/lda.par.dump
W0720 12:28:20.549247 2803 Unigram_Model_Streamer.cpp:94] Alpha vector initialized
W0720 12:28:20.549309 2803 Unigram_Model_Streamer.cpp:97] Initializing Beta Parameter from specified Beta = 0.01
W0720 12:28:20.549383 2803 Unigram_Model_Streamer.cpp:101] Beta param initialized
W0720 12:28:20.557430 2803 Testing_Execution_Strategy.cpp:64] Starting Parallel testing Pipeline
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports watch out jeremy sherwin is here over the past six months you may have noticed this guy in every surf magazine published jeremy is finally getting his run more.. copyright surfboards 2004 all rights reserved june 6 2004 new launches it s new and improved site you can now order custom surfboards online more improvements to come.. top selling models middot rocket fish middot speed egg middot classic middot squash
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,83) (past,86) (months,77) (noticed,15) (guy,93) (surf,35) (magazine,86) (published,92) (finally,49) (run,21) (copyright,62) (surfboards,27) (rights,90) (reserved,59) (june,63) (launches,26) (improved,40) (site,26) (order,72) (custom,36) (surfboards,11) (online,68) (improvements,67) (top,29) (selling,82) (models,30) (middot,62) (rocket,23) (fish,67) (middot,35) (speed,29) (egg,2) (middot,22) (classic,58) (middot,69) (squash,67)
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports watch out jeremy sherwin is here over the past six months you may have noticed this guy in every surf magazine published jeremy is finally getting his run more.. copyright surfboards 2004 all rights reserved june 6 2004 new launches it s new and improved site you can now order custom surfboards online more improvements to come.. top selling models middot rocket fish middot speed egg middot classic middot squash
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,93) (past,56) (months,11) (noticed,42) (guy,29) (surf,73) (magazine,21) (published,19) (finally,84) (run,37) (copyright,98) (surfboards,24) (rights,15) (reserved,70) (june,13) (launches,26) (improved,91) (site,80) (order,56) (custom,73) (surfboards,62) (online,70) (improvements,96) (top,81) (selling,5) (models,25) (middot,84) (rocket,27) (fish,36) (middot,5) (speed,46) (egg,29) (middot,13) (classic,57) (middot,24) (squash,95)
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports watch out jeremy sherwin is here over the past six months you may have noticed this guy in every surf magazine published jeremy is finally getting his run more.. copyright surfboards 2004 all rights reserved june 6 2004 new launches it s new and improved site you can now order custom surfboards online more improvements to come.. top selling models middot rocket fish middot speed egg middot classic middot squash
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,82) (past,45) (months,14) (noticed,67) (guy,34) (surf,64) (magazine,43) (published,50) (finally,87) (run,8) (copyright,76) (surfboards,78) (rights,88) (reserved,84) (june,3) (launches,51) (improved,54) (site,99) (order,32) (custom,60) (surfboards,76) (online,68) (improvements,39) (top,12) (selling,26) (models,86) (middot,94) (rocket,39) (fish,95) (middot,70) (speed,34) (egg,78) (middot,67) (classic,1) (middot,97) (squash,2)
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports watch out jeremy sherwin is here over the past six months you may have noticed this guy in every surf magazine published jeremy is finally getting his run more.. copyright surfboards 2004 all rights reserved june 6 2004 new launches it s new and improved site you can now order custom surfboards online more improvements to come.. top selling models middot rocket fish middot speed egg middot classic middot squash
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,17) (past,92) (months,52) (noticed,56) (guy,1) (surf,80) (magazine,86) (published,41) (finally,65) (run,89) (copyright,44) (surfboards,19) (rights,40) (reserved,29) (june,31) (launches,17) (improved,97) (site,71) (order,81) (custom,75) (surfboards,9) (online,27) (improvements,67) (top,56) (selling,97) (models,53) (middot,86) (rocket,65) (fish,6) (middot,83) (speed,19) (egg,24) (middot,28) (classic,71) (middot,32) (squash,29)

the results of Y!LDA with multi machines

Hi,

 I am using Y!LDA in Hadoop with 3 computers.
 I got the results of "train mode" and found it a little bit confusion.  I ran the script with --topics=20, and found that the files "lda.docToTop.txt, lda.topToWor.txt, lda.worToTop.txt" exist in 3 different directories. Each directory has 20 topics. Is it correct? 
What am I supposed to get the "test" result from the "trained model"? Still 3 different directories?

Hope somebody can help me. Thanks a lot!

Yanbo

Failed stream-job while run Yahoo!LDA on hadoop distribution system

I have followed single machine setup and run successfully. However, I got a similar problem with yuvalye while run on multi machine, but not the same.

The failed job output is below:

A checkpointed directory exists. Do you want to start from this checkpoint?
[: 71: ==: unexpected operator
Deleted hdfs://xx:xx/xx/xx/yahoo_lda/output_0

/opt/hadoop/hadoop-0.20.2/bin/hadoop jar /opt/hadoop/hadoop-0.20.2/hadoop-streaming.jar -Dmapred.job.queue.name=default -Dmapred.reduce.tasks.speculative.execution=false -Dmapred.job.reduce.memory.mb=8000 -Dmapred.reduce.tasks=1 -Dmapred.child.ulimit=8001000 -Dmapred.task.timeout=1800000 -Dmapred.reduce.max.attempts=1 -Dmapreduce.job.acl-view-job=shravanm,smola -input ./yahoo_lda/ut/ydir_1k.txt -output ./yahoo_lda/output_0 -cacheArchive hdfs://xx:xx/xx/xx/LDALibs.jar#LDALibs -mapper /bin/cat -reducer Formatter.sh 1 " " I -file Formatter.sh -file functions.sh -numReduceTasks 3
11/12/08 14:30:34 WARN streaming.StreamJob: -cacheArchive option is deprecated, please use -archives instead.
packageJobJar: [Formatter.sh, functions.sh, /tmp/hadoop-idm/hadoop-unjar8278286049299668675/] [] /tmp/streamjob5832830091559634848.jar tmpDir=null
11/12/08 14:30:35 INFO mapred.FileInputFormat: Total input paths to process : 1
11/12/08 14:30:35 INFO streaming.StreamJob: getLocalDirs(): [/opt/hadoop/hadoop-0.20.2/data/mapredlocal]
11/12/08 14:30:35 INFO streaming.StreamJob: Running job: job_201112081253_0005
11/12/08 14:30:35 INFO streaming.StreamJob: To kill this job, run:
11/12/08 14:30:35 INFO streaming.StreamJob: /opt/hadoop/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=192.168.1.21:9001 -kill job_201112081253_0005
11/12/08 14:30:35 INFO streaming.StreamJob: Tracking URL: http://xx:50030/jobdetails.jsp?jobid=job_201112081253_0005
11/12/08 14:30:36 INFO streaming.StreamJob: map 0% reduce 0%
11/12/08 14:30:44 INFO streaming.StreamJob: map 100% reduce 0%
11/12/08 14:30:53 INFO streaming.StreamJob: map 100% reduce 28%
11/12/08 14:30:56 INFO streaming.StreamJob: map 100% reduce 22%
11/12/08 14:30:59 INFO streaming.StreamJob: map 100% reduce 100%
11/12/08 14:30:59 INFO streaming.StreamJob: To kill this job, run:
11/12/08 14:30:59 INFO streaming.StreamJob: /opt/hadoop/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=192.168.1.21:9001 -kill job_201112081253_0005
11/12/08 14:30:59 INFO streaming.StreamJob: Tracking URL: http://xx:50030/jobdetails.jsp?jobid=job_201112081253_0005
11/12/08 14:30:59 ERROR streaming.StreamJob: Job not Successful!
11/12/08 14:30:59 INFO streaming.StreamJob: killJob...
Streaming Job Failed!
exit_code=1
set +x
Unable to run Formatter on your corpus

and the jobtracker log is:

557 2011-12-08 14:30:52,576 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201112081253_0005_r_000000_0: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
558 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
559 at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
560 at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:137)
561 at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:473)
562 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)

563 at org.apache.hadoop.mapred.Child.main(Child.java:170)

From yuvalye's issue description we can see that is the "Permission denied" problem. But I have no idea what's problem I have got, could some one help me?

Thanks
YimingChen

Yahoo LDA and Hadoop

After a couple of manual twists, I got the Yahoo! LDA distributed setup running on top of Hadoop 1.0.4. It is a simple 2-nodes Hadoop configuration. When I executed the runLDA.sh, I used "2" for the number of machines. Everything ran OK and I also checked the logs from Hadoop to make sure that everything looked normal. After it completed the run, I got two output directories and all the files as described in Yahoo! LDA documentation. So far so good...

Then, I continued to run a single machine setup against the same corpus / documents with the same numbers of topics and iterations. After it completed the run, I got all the files as described in Yahoo! LDA documentation too. So far so good...

However, when I started to compare the results between the distributed and single machine setup, they look pretty different to me. Though the number of topics are the same, the topics look very different. The document to topic outputs look pretty different too.

Here is an example.

Below is the topics that have the word "portion" from the distributed setup.

Topic 0: (portion,0.150299) (hous,0.129272) (top,0.113109) (member,0.0838274) (bottom,0.0588456) (featur,0.0582807) (configur,0.0468569) (materi,0.0431222) (seal,0.0416157) (cover,0.0299408) (perimet,0.0279322) (mechan,0.0273359) (port,0.0266141) (interior,0.0252332) (caviti,0.0248566) (membran,0.0238209) (factor,0.0233815) (electron,0.0233501) (latch,0.0211846) (coupl,0.0211219)
Topic 7: (magnet,0.173736) (portion,0.115939) (surfac,0.106755) (direct,0.0771537) (guid,0.0469494) (field,0.0426735) (coil,0.0386558) (side,0.0377519) (face,0.0375079) (shield,0.0358291) (medium,0.0347386) (end,0.0340499) (layer,0.0312805) (form,0.030477) (part,0.029817) (main,0.0278081) (pole,0.0272916) (section,0.0268181) (shape,0.0247949) (head,0.0199737)
Topic 12: (member,0.151508) (roller,0.0885818) (form,0.0709548) (imag,0.0678395) (sheet,0.0652739) (develop,0.0566723) (fix,0.0477156) (rotat,0.0468222) (toner,0.0461121) (portion,0.0448522) (direct,0.0416338) (belt,0.0406029) (side,0.0360559) (transfer,0.0329291) (surfac,0.0320128) (posit,0.0302947) (drum,0.0264349) (unit,0.0246825) (press,0.0245451) (apparatu,0.0244763)
Topic 18: (line,0.231612) (displai,0.110021) (electrod,0.0977109) (pixel,0.0831908) (crystal,0.0775907) (liquid,0.0751507) (panel,0.0452805) (portion,0.0372104) (plural,0.0269503) (direct,0.0253603) (substrat,0.0251203) (align,0.0219703) (form,0.0201103) (connect,0.0197803) (common,0.0188203) (view,0.0173602) (arrang,0.0172402) (gate,0.0171002) (polar,0.0165002) (respect,0.0159202)
Topic 47: (region,0.433173) (portion,0.0659125) (semiconductor,0.0466138) (structur,0.0431049) (gate,0.0370144) (implant,0.0332915) (sourc,0.0312804) (diffus,0.0297256) (dope,0.0259315) (present,0.0252183) (illustr,0.0249045) (charg,0.0243625) (zone,0.0241771) (form,0.0241057) (trench,0.0236636) (channel,0.0228648) (impur,0.0218521) (ion,0.0212958) (concentr,0.0208964) (drain,0.0206111)
Topic 54: (transfer,0.293593) (properti,0.0975047) (pre,0.0648832) (medium,0.0623078) (identifi,0.0547478) (instruct,0.0422863) (class,0.0393509) (diagram,0.0378002) (process,0.0360002) (sensit,0.0304894) (portion,0.0298802) (set,0.0274987) (match,0.026391) (inform,0.0247294) (oper,0.0234002) (classifi,0.0228187) (number,0.0221264) (determin,0.0216556) (label,0.021351) (step,0.0211848)
Topic 57: (chip,0.0976907) (wire,0.0929568) (electr,0.0724923) (connect,0.0669996) (pad,0.0622883) (conduct,0.0585963) (substrat,0.0582906) (circuit,0.0577923) (surfac,0.0551535) (packag,0.0491625) (semiconductor,0.047577) (board,0.0425033) (plural,0.0378487) (bond,0.0329676) (form,0.0302948) (contact,0.0297399) (portion,0.0287886) (mount,0.0269879) (compon,0.0266368) (side,0.0252325)
Topic 92: (portion,0.1064) (bodi,0.0994396) (side,0.068734) (end,0.0669684) (support,0.0604946) (posit,0.0601663) (assembl,0.0473205) (mount,0.0467772) (wall,0.0426122) (member,0.0421821) (surfac,0.0403939) (engag,0.0394771) (view,0.0380171) (plate,0.0377681) (connect,0.0358893) (front,0.035878) (open,0.0349726) (cover,0.032709) (attach,0.0325619) (rotat,0.0312377)

Below is the topics that have the word "portion" from the single machine setup.

Topic 6: (electr,0.158046) (structur,0.127875) (conduct,0.106806) (circuit,0.0866087) (connect,0.0616967) (interconnect,0.0502754) (connector,0.0435378) (integr,0.0398189) (mechan,0.0348396) (conductor,0.0320855) (illustr,0.0320388) (coupl,0.0315253) (ground,0.0308562) (contact,0.0285066) (portion,0.0271684) (fuse,0.0240252) (carrier,0.0237296) (isol,0.0217379) (substrat,0.0213022) (shown,0.017521)
Topic 7: (portion,0.270355) (side,0.108089) (end,0.0707633) (bodi,0.059409) (surfac,0.0554205) (direct,0.0473438) (form,0.0400563) (section,0.037248) (guid,0.0355345) (view,0.0340486) (upper,0.0292069) (shown,0.0286381) (shape,0.0268464) (main,0.0264341) (front,0.026107) (open,0.0252183) (extend,0.0208885) (lower,0.0202913) (hole,0.019694) (case,0.0184072)
Topic 24: (region,0.621829) (characterist,0.0302271) (portion,0.0278799) (laser,0.0274319) (direct,0.0259627) (overlap,0.0255685) (vertic,0.025246) (differ,0.0227913) (edg,0.020677) (posit,0.0195661) (width,0.018509) (adjac,0.0180611) (plural,0.0169681) (standard,0.0163947) (layout,0.0156422) (beam,0.0144955) (extend,0.0142984) (background,0.0139042) (specif,0.0122916) (abov,0.0122558)
Topic 82: (film,0.161729) (semiconductor,0.126738) (form,0.105283) (gate,0.094141) (insul,0.0559734) (silicon,0.0484004) (oxid,0.0421204) (electrod,0.0409827) (sourc,0.0344884) (transistor,0.0342963) (substrat,0.0337643) (drain,0.0326857) (conduct,0.0289546) (surfac,0.0261471) (region,0.0252235) (portion,0.0246842) (thin,0.0241079) (impur,0.0209088) (etch,0.0197341) (diffus,0.019638)
Topic 89: (electrod,0.309474) (wire,0.171766) (electr,0.0730991) (resist,0.0727857) (connect,0.0579722) (insul,0.0379882) (form,0.037917) (protect,0.0311655) (present,0.0265505) (discharg,0.0264081) (view,0.0192577) (section,0.0184458) (appli,0.0175627) (illustr,0.0160386) (contact,0.0155543) (addit,0.015127) (portion,0.0137311) (dispos,0.013546) (shown,0.0130902) (side,0.0125204)

Should both distributed and single machine setup have similar results? Or, is there any good way to compare the results between the distributed and single machine setup systematically?

Thanks!

Maxmem parameter in learntopics

In the web documentation for single machine, it references a "--maxmem=[size in MB]" command option. However, it seems that the command option is really -maxmemory=[size in MB]. Can someone confirm this. The command line help indicates "-maxmemory (The max memory that can be used) type: int32 default: 2048"

std::bad_alloc exception

Outgoing.cpp:424: Ice::ObjectNotExistException:
object does not exist:
identity: `DM_Server_19'
facet:
operation: ice_isA
terminate called after throwing an instance of 'std::bad_alloc'
what(): St9bad_alloc
*** Aborted at 1325992166 (unix time) try "date -d @1325992166" if you are using GNU date ***
PC: @ 0x3246430265 (unknown)
*** SIGABRT (@0xc3b8000042ff) received by PID 17151 (TID 0x2b0867e588c0) from PID 17151; stack trace: ***
@ 0x3246c0eb10 (unknown)
@ 0x3246430265 (unknown)
@ 0x3246431d10 (unknown)
@ 0x2b0867c13d14 (unknown)
@ 0x2b0867c11e16 (unknown)
@ 0x2b0867c11e43 (unknown)
@ 0x2b0867c11f2a (unknown)
@ 0x2b0867c12239 (unknown)
@ 0x2b0867c122f9 (unknown)
@ 0x44cafe TypeTopicCounts::TypeTopicCounts()
@ 0x445d10 main
@ 0x324641d994 (unknown)
@ 0x40cda9 (unknown)
/data/1/mr/local/taskTracker/jwang/jobcache/job_201112201444_4493/attempt_201112201444_4493_m_000010_0/work/./LDA.sh: line 160: 17151 Aborted $LDALIBS/Merge_Topic_Counts --topics=$topics --clientid=${mapred_task_partition} --servers="$servers" --globaldictionary="lda.dict.dump.global"
Synch directory: hdfs://hadooprsonn001.bo1.shopzilla.sea/user/jwang/workspace/ldanew/temporary/synchronize/merge_topcnts
Num of map tasks: 20
Found 1 items
-rw-r--r-- 3 mapred supergroup 10 2012-01-07 19:09 /user/jwang/workspace/ldanew/temporary/synchronize/merge_topcnts/10
Num of clients done: 1
Sleeping
Num of clients done: 15
Sleeping
Num of clients done: 20
All clients done!
put: File lda.ttc.dump does not exist.
dput lda.ttc.dump returned an error code of 255
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 255
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)