Giter VIP home page Giter VIP logo

Comments (12)

brainstorm avatar brainstorm commented on July 29, 2024

The last "Job not successful failure" is closely related to the following post according to the hadoop job log:

http://www.curiousattemptbunny.com/2009/10/hadoop-streaming-javalangruntimeexcepti.html

Smells like virtualenv is causing trouble when running dumbo (cannot find the right version of python on the worker nodes ?).

I've tried hardcoding the sh-bang as the post suggests but didn't help :-S

from dumbo.

brainstorm avatar brainstorm commented on July 29, 2024

The actual Hadoop exception is different from the one on the post:

ReduceAttempt TASK_TYPE="REDUCE" TASKID="task_201102242242_0018_r_000000" TASK_ATTEMPT_ID="attempt_201102242242_0018_r_000000_0" TASK_STATUS="FAILED" FINISH_TIME="1298749520679" HOSTNAME="$HOST" ERROR="java\.lang\.RuntimeException: PipeMapRed\.waitOutputThreads(): subprocess failed with code 2
    at org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:362)
    at org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:572)
    at org\.apache\.hadoop\.streaming\.PipeReducer\.close(PipeReducer\.java:137)
    at org\.apache\.hadoop\.mapred\.ReduceTask\.runOldReducer(ReduceTask\.java:478)
    at org\.apache\.hadoop\.mapred\.ReduceTask\.run(ReduceTask\.java:416)
    at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:240)
    at java\.security\.AccessController\.doPrivileged(Native Method)
    at javax\.security\.auth\.Subject\.doAs(Subject\.java:396)
    at org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1115)
    at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:234)

I am running this clustered hadoop environment without root privileges (just with a regular user).

from dumbo.

klbostee avatar klbostee commented on July 29, 2024

The problem really is the way in which you install Dumbo -- it has to be installed as an egg (that hasn't been unzipped into a directory). Commenting out the fileopt stuff hides the symptoms somewhat but it definitely won't fix anything, it actually makes things worse even.

When you start a Dumbo job, Dumbo will send itself along with the job by using the option "-file path_to_egg" internally, which won't work when it's not installed as an egg or when you disable the -file option (but the latter might indeed lead to less explicit errors, as you discovered).

from dumbo.

brainstorm avatar brainstorm commented on July 29, 2024

Thanks Indeed ! I just "python setup.py install" to generate an egg and works without commenting the code, but fails the same way on the hadoop side:

ReduceAttempt TASK_TYPE="REDUCE" TASKID="task_201102262100_0002_r_000000" TASK_ATTEMPT_ID="attempt_201102262100_0002_r_000000_0" TASK_STATUS="FAILED" FINISH_TIME="1298885499654" HOSTNAME="$HOSTNAME" ERROR="java\.lang\.RuntimeException: PipeMapRed\.waitOutputThreads(): subprocess failed with code 2
    at org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:362)
    at org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:572)
    at org\.apache\.hadoop\.streaming\.PipeReducer\.close(PipeReducer\.java:137)
    at org\.apache\.hadoop\.mapred\.ReduceTask\.runOldReducer(ReduceTask\.java:478)
    at org\.apache\.hadoop\.mapred\.ReduceTask\.run(ReduceTask\.java:416)
    at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:240)
    at java\.security\.AccessController\.doPrivileged(Native Method)
    at javax\.security\.auth\.Subject\.doAs(Subject\.java:396)
    at org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1115)
    at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:234)
" .

Any ideas why ? Other hadoop examples (pi estimator) work fine :-S

from dumbo.

klbostee avatar klbostee commented on July 29, 2024

Sounds like a bug in your Dumbo script. The Hadoop Java exceptions are rarely useful in that case, you need to check the stderr logs instead (in webui, click on jobid -> failed tasks number -> last 4KB (under "logs")).

from dumbo.

brainstorm avatar brainstorm commented on July 29, 2024

Yes, here it is:

stderr logs
/usr/bin/python: module ipcount not found
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
(...)

I'm running dumbo as the tutorial states:

dumbo start ipcount.py -hadoop $HADOOP_HOME -input access.log -output ipcounts

Must be something with my python virtual environment not being able to import ipcount.py and dumbo egg then ? Is ipcount.py supposed to bundled in the hadoop job somehow as the dumbo egg ?

from dumbo.

klbostee avatar klbostee commented on July 29, 2024

The ipcount.py script should be submitted along with the job as well (by adding "-file ipcount.py" under the hood). Are you sure you enabled all of the fileopt code again?

from dumbo.

brainstorm avatar brainstorm commented on July 29, 2024

I removed every dumbo/typedbytes file/lib lying around on site-packages and re-installed the egg via python install (rolling back on the commented lines), and it seems that the eggs and "ipcount.py" are passed to the job:

(...) -cmdenv 'PYTHONPATH=dumbo-0.21.30-py2.6.egg:typedbytes-0.3.6-py2.6.egg'
-file 'PATH_TO/ipcount.py'
-file 'PATH_TO.virtualenv/devel/lib/python2.6/site-packages/dumbo-0.21.30-py2.6.egg'
-file 'PATH_TO.virtualenv/devel/lib/python2.6/site-packages/typedbytes-0.3.6-py2.6.egg'

Same result though:

/usr/bin/python: module ipcount not found

Tried to hardcode the sh-bang as the post suggests to my virtualenv's python:

PATH_TO/.virtualenvs/devel/bin/python

But same effect on the Hadoop job:

/usr/bin/python: module ipcount not found

:-(

Thanks for your support !

from dumbo.

brainstorm avatar brainstorm commented on July 29, 2024

I've been trying to adjust sys.path inside ipcount.py but it still cannot find ipcount(.py) file when running:

2011-03-01 13:26:07,241 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec [/usr/bin/python, -m, ipcount, red, 0, 262144000]

Any further ideas ?

stderr logs
/usr/bin/python: module ipcount not found
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
(...)

from dumbo.

brainstorm avatar brainstorm commented on July 29, 2024

Now I tried to pass "-pypath" and "-python" flags explicitly to dumbo:

$ dumbo start ipcount.py -hadoop $HADOOP_HOME -input access.log -output ipcounts -pypath '.:path/to/.virtualenvs/devel/lib/python2.6/site-packages' -python '/path/to/.virtualenvs/devel/bin/python'

The "." on pypath allows dumbo to find the ipcounts "module".

But now the error refers to the python importer:

'import site' failed; use -v for traceback
Could not import runpy module

I added the -v on dumbo/backends/common.py but I couldn't see clear clues on why "site" does not get imported correctly...

Did you manage to have dumbo flying on virtualenv and -hadoop mode ? From your post, it seems that this is only tested on local mode:

http://dumbotics.com/2009/05/24/virtual-pythonenvironments/

What am I doing wrong ? :-S

from dumbo.

brainstorm avatar brainstorm commented on July 29, 2024

Moved the issue to dumbo-user mailing list:

http://groups.google.com/group/dumbo-user/t/c9d368625daa2629

from dumbo.

dgleich avatar dgleich commented on July 29, 2024

I fixed this issue by the following patch

--- a/dumbo/backends/streaming.py
+++ b/dumbo/backends/streaming.py
@@ -76,7 +76,7 @@ class StreamingIteration(Iteration):
         if modpath.endswith('.egg'):
             addedopts.add('libegg', modpath)
         else:
-            opts.add('file', modpath)
+            opts.add('file', 'file://' + modpath)
         opts.add('jobconf', 'stream.map.input=typedbytes')
         opts.add('jobconf', 'stream.reduce.input=typedbytes')

from dumbo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.