Comments (12)
The last "Job not successful failure" is closely related to the following post according to the hadoop job log:
http://www.curiousattemptbunny.com/2009/10/hadoop-streaming-javalangruntimeexcepti.html
Smells like virtualenv is causing trouble when running dumbo (cannot find the right version of python on the worker nodes ?).
I've tried hardcoding the sh-bang as the post suggests but didn't help :-S
from dumbo.
The actual Hadoop exception is different from the one on the post:
ReduceAttempt TASK_TYPE="REDUCE" TASKID="task_201102242242_0018_r_000000" TASK_ATTEMPT_ID="attempt_201102242242_0018_r_000000_0" TASK_STATUS="FAILED" FINISH_TIME="1298749520679" HOSTNAME="$HOST" ERROR="java\.lang\.RuntimeException: PipeMapRed\.waitOutputThreads(): subprocess failed with code 2 at org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:362) at org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:572) at org\.apache\.hadoop\.streaming\.PipeReducer\.close(PipeReducer\.java:137) at org\.apache\.hadoop\.mapred\.ReduceTask\.runOldReducer(ReduceTask\.java:478) at org\.apache\.hadoop\.mapred\.ReduceTask\.run(ReduceTask\.java:416) at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:240) at java\.security\.AccessController\.doPrivileged(Native Method) at javax\.security\.auth\.Subject\.doAs(Subject\.java:396) at org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1115) at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:234)
I am running this clustered hadoop environment without root privileges (just with a regular user).
from dumbo.
The problem really is the way in which you install Dumbo -- it has to be installed as an egg (that hasn't been unzipped into a directory). Commenting out the fileopt stuff hides the symptoms somewhat but it definitely won't fix anything, it actually makes things worse even.
When you start a Dumbo job, Dumbo will send itself along with the job by using the option "-file path_to_egg" internally, which won't work when it's not installed as an egg or when you disable the -file option (but the latter might indeed lead to less explicit errors, as you discovered).
from dumbo.
Thanks Indeed ! I just "python setup.py install" to generate an egg and works without commenting the code, but fails the same way on the hadoop side:
ReduceAttempt TASK_TYPE="REDUCE" TASKID="task_201102262100_0002_r_000000" TASK_ATTEMPT_ID="attempt_201102262100_0002_r_000000_0" TASK_STATUS="FAILED" FINISH_TIME="1298885499654" HOSTNAME="$HOSTNAME" ERROR="java\.lang\.RuntimeException: PipeMapRed\.waitOutputThreads(): subprocess failed with code 2 at org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:362) at org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:572) at org\.apache\.hadoop\.streaming\.PipeReducer\.close(PipeReducer\.java:137) at org\.apache\.hadoop\.mapred\.ReduceTask\.runOldReducer(ReduceTask\.java:478) at org\.apache\.hadoop\.mapred\.ReduceTask\.run(ReduceTask\.java:416) at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:240) at java\.security\.AccessController\.doPrivileged(Native Method) at javax\.security\.auth\.Subject\.doAs(Subject\.java:396) at org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1115) at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:234) " .
Any ideas why ? Other hadoop examples (pi estimator) work fine :-S
from dumbo.
Sounds like a bug in your Dumbo script. The Hadoop Java exceptions are rarely useful in that case, you need to check the stderr logs instead (in webui, click on jobid -> failed tasks number -> last 4KB (under "logs")).
from dumbo.
Yes, here it is:
stderr logs
/usr/bin/python: module ipcount not found
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
(...)
I'm running dumbo as the tutorial states:
dumbo start ipcount.py -hadoop $HADOOP_HOME -input access.log -output ipcounts
Must be something with my python virtual environment not being able to import ipcount.py and dumbo egg then ? Is ipcount.py supposed to bundled in the hadoop job somehow as the dumbo egg ?
from dumbo.
The ipcount.py script should be submitted along with the job as well (by adding "-file ipcount.py" under the hood). Are you sure you enabled all of the fileopt code again?
from dumbo.
I removed every dumbo/typedbytes file/lib lying around on site-packages and re-installed the egg via python install (rolling back on the commented lines), and it seems that the eggs and "ipcount.py" are passed to the job:
(...) -cmdenv 'PYTHONPATH=dumbo-0.21.30-py2.6.egg:typedbytes-0.3.6-py2.6.egg' -file 'PATH_TO/ipcount.py' -file 'PATH_TO.virtualenv/devel/lib/python2.6/site-packages/dumbo-0.21.30-py2.6.egg' -file 'PATH_TO.virtualenv/devel/lib/python2.6/site-packages/typedbytes-0.3.6-py2.6.egg'
Same result though:
/usr/bin/python: module ipcount not found
Tried to hardcode the sh-bang as the post suggests to my virtualenv's python:
PATH_TO/.virtualenvs/devel/bin/python
But same effect on the Hadoop job:
/usr/bin/python: module ipcount not found
:-(
Thanks for your support !
from dumbo.
I've been trying to adjust sys.path inside ipcount.py but it still cannot find ipcount(.py) file when running:
2011-03-01 13:26:07,241 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec [/usr/bin/python, -m, ipcount, red, 0, 262144000]
Any further ideas ?
stderr logs /usr/bin/python: module ipcount not found java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) (...)
from dumbo.
Now I tried to pass "-pypath" and "-python" flags explicitly to dumbo:
$ dumbo start ipcount.py -hadoop $HADOOP_HOME -input access.log -output ipcounts -pypath '.:path/to/.virtualenvs/devel/lib/python2.6/site-packages' -python '/path/to/.virtualenvs/devel/bin/python'
The "." on pypath allows dumbo to find the ipcounts "module".
But now the error refers to the python importer:
'import site' failed; use -v for traceback Could not import runpy module
I added the -v on dumbo/backends/common.py but I couldn't see clear clues on why "site" does not get imported correctly...
Did you manage to have dumbo flying on virtualenv and -hadoop mode ? From your post, it seems that this is only tested on local mode:
http://dumbotics.com/2009/05/24/virtual-pythonenvironments/
What am I doing wrong ? :-S
from dumbo.
Moved the issue to dumbo-user mailing list:
http://groups.google.com/group/dumbo-user/t/c9d368625daa2629
from dumbo.
I fixed this issue by the following patch
--- a/dumbo/backends/streaming.py
+++ b/dumbo/backends/streaming.py
@@ -76,7 +76,7 @@ class StreamingIteration(Iteration):
if modpath.endswith('.egg'):
addedopts.add('libegg', modpath)
else:
- opts.add('file', modpath)
+ opts.add('file', 'file://' + modpath)
opts.add('jobconf', 'stream.map.input=typedbytes')
opts.add('jobconf', 'stream.reduce.input=typedbytes')
from dumbo.
Related Issues (20)
- Add access to filepath in MultiMapper
- Crash if mapper or reducer does not yield anything HOT 1
- dumbo cat can be slow in case of many part files
- Implement params access via global variable like os.environ
- JoinReducer/JoinCombiner to allow full outer join HOT 2
- MultiMapper fails with single-parameter mappers HOT 1
- MultiMapper does not support cleanup functionality
- tunnel/proxy HOT 2
- cdh4, centos 6.3, cannot get simple dumbo job to run. HOT 1
- " -file option is deprecated, please use generic option -files instead." HOT 1
- Support for SequenceFiles in local runs HOT 1
- Integration Amazon EMR
- Reading text as typedbytes affects lines with encoding other than utf8
- memlimit enabled by default
- Custom Input File Formats
- Set reducer‘s numbers failed
- The -fake option does not work as described when using Job.run()
- links in README are broken
- installation problem: could not find typedbytes HOT 1
- website is down
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dumbo.