Hello, I'm having the same issue described in this thread, but with

Sounds like a bug in your Dumbo . The Hadoop Java exceptions are rarely useful i

The ipcount.py should be submitted along with the job as well (by adding "-file

VirtualEnv + Hadoop CDH3B4 mode + Dumbo = import site error (was: StreamJob fails -- error finding typedbytes.pyc even though it exists) about dumbo HOT 12 CLOSED

klbostee commented on July 29, 2024

VirtualEnv + Hadoop CDH3B4 mode + Dumbo = import site error (was: StreamJob fails -- error finding typedbytes.pyc even though it exists)

from dumbo.

Comments (12)

brainstorm commented on July 29, 2024

The last "Job not successful failure" is closely related to the following post according to the hadoop job log:

http://www.curiousattemptbunny.com/2009/10/hadoop-streaming-javalangruntimeexcepti.html

Smells like virtualenv is causing trouble when running dumbo (cannot find the right version of python on the worker nodes ?).

I've tried hardcoding the sh-bang as the post suggests but didn't help :-S

from dumbo.

brainstorm commented on July 29, 2024

The actual Hadoop exception is different from the one on the post:

ReduceAttempt TASK_TYPE="REDUCE" TASKID="task_201102242242_0018_r_000000" TASK_ATTEMPT_ID="attempt_201102242242_0018_r_000000_0" TASK_STATUS="FAILED" FINISH_TIME="1298749520679" HOSTNAME="$HOST" ERROR="java\.lang\.RuntimeException: PipeMapRed\.waitOutputThreads(): subprocess failed with code 2
    at org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:362)
    at org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:572)
    at org\.apache\.hadoop\.streaming\.PipeReducer\.close(PipeReducer\.java:137)
    at org\.apache\.hadoop\.mapred\.ReduceTask\.runOldReducer(ReduceTask\.java:478)
    at org\.apache\.hadoop\.mapred\.ReduceTask\.run(ReduceTask\.java:416)
    at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:240)
    at java\.security\.AccessController\.doPrivileged(Native Method)
    at javax\.security\.auth\.Subject\.doAs(Subject\.java:396)
    at org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1115)
    at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:234)

I am running this clustered hadoop environment without root privileges (just with a regular user).

from dumbo.

klbostee commented on July 29, 2024

The problem really is the way in which you install Dumbo -- it has to be installed as an egg (that hasn't been unzipped into a directory). Commenting out the fileopt stuff hides the symptoms somewhat but it definitely won't fix anything, it actually makes things worse even.

When you start a Dumbo job, Dumbo will send itself along with the job by using the option "-file path_to_egg" internally, which won't work when it's not installed as an egg or when you disable the -file option (but the latter might indeed lead to less explicit errors, as you discovered).

from dumbo.

brainstorm commented on July 29, 2024

Thanks Indeed ! I just "python setup.py install" to generate an egg and works without commenting the code, but fails the same way on the hadoop side:

ReduceAttempt TASK_TYPE="REDUCE" TASKID="task_201102262100_0002_r_000000" TASK_ATTEMPT_ID="attempt_201102262100_0002_r_000000_0" TASK_STATUS="FAILED" FINISH_TIME="1298885499654" HOSTNAME="$HOSTNAME" ERROR="java\.lang\.RuntimeException: PipeMapRed\.waitOutputThreads(): subprocess failed with code 2
    at org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:362)
    at org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:572)
    at org\.apache\.hadoop\.streaming\.PipeReducer\.close(PipeReducer\.java:137)
    at org\.apache\.hadoop\.mapred\.ReduceTask\.runOldReducer(ReduceTask\.java:478)
    at org\.apache\.hadoop\.mapred\.ReduceTask\.run(ReduceTask\.java:416)
    at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:240)
    at java\.security\.AccessController\.doPrivileged(Native Method)
    at javax\.security\.auth\.Subject\.doAs(Subject\.java:396)
    at org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1115)
    at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:234)
" .

Any ideas why ? Other hadoop examples (pi estimator) work fine :-S

from dumbo.

klbostee commented on July 29, 2024

Sounds like a bug in your Dumbo script. The Hadoop Java exceptions are rarely useful in that case, you need to check the stderr logs instead (in webui, click on jobid -> failed tasks number -> last 4KB (under "logs")).

from dumbo.

brainstorm commented on July 29, 2024

Yes, here it is:

stderr logs
/usr/bin/python: module ipcount not found
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
(...)

I'm running dumbo as the tutorial states:

dumbo start ipcount.py -hadoop $HADOOP_HOME -input access.log -output ipcounts

Must be something with my python virtual environment not being able to import ipcount.py and dumbo egg then ? Is ipcount.py supposed to bundled in the hadoop job somehow as the dumbo egg ?

from dumbo.

klbostee commented on July 29, 2024

The ipcount.py script should be submitted along with the job as well (by adding "-file ipcount.py" under the hood). Are you sure you enabled all of the fileopt code again?

from dumbo.

brainstorm commented on July 29, 2024

I removed every dumbo/typedbytes file/lib lying around on site-packages and re-installed the egg via python install (rolling back on the commented lines), and it seems that the eggs and "ipcount.py" are passed to the job:

(...) -cmdenv 'PYTHONPATH=dumbo-0.21.30-py2.6.egg:typedbytes-0.3.6-py2.6.egg'
-file 'PATH_TO/ipcount.py'
-file 'PATH_TO.virtualenv/devel/lib/python2.6/site-packages/dumbo-0.21.30-py2.6.egg'
-file 'PATH_TO.virtualenv/devel/lib/python2.6/site-packages/typedbytes-0.3.6-py2.6.egg'

Same result though:

/usr/bin/python: module ipcount not found

Tried to hardcode the sh-bang as the post suggests to my virtualenv's python:

PATH_TO/.virtualenvs/devel/bin/python

But same effect on the Hadoop job:

/usr/bin/python: module ipcount not found

:-(

Thanks for your support !

from dumbo.

brainstorm commented on July 29, 2024

I've been trying to adjust sys.path inside ipcount.py but it still cannot find ipcount(.py) file when running:

2011-03-01 13:26:07,241 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec [/usr/bin/python, -m, ipcount, red, 0, 262144000]

Any further ideas ?

stderr logs
/usr/bin/python: module ipcount not found
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
(...)

from dumbo.

brainstorm commented on July 29, 2024

Now I tried to pass "-pypath" and "-python" flags explicitly to dumbo:

$ dumbo start ipcount.py -hadoop $HADOOP_HOME -input access.log -output ipcounts -pypath '.:path/to/.virtualenvs/devel/lib/python2.6/site-packages' -python '/path/to/.virtualenvs/devel/bin/python'

The "." on pypath allows dumbo to find the ipcounts "module".

But now the error refers to the python importer:

'import site' failed; use -v for traceback
Could not import runpy module

I added the -v on dumbo/backends/common.py but I couldn't see clear clues on why "site" does not get imported correctly...

Did you manage to have dumbo flying on virtualenv and -hadoop mode ? From your post, it seems that this is only tested on local mode:

http://dumbotics.com/2009/05/24/virtual-pythonenvironments/

What am I doing wrong ? :-S

from dumbo.

brainstorm commented on July 29, 2024

Moved the issue to dumbo-user mailing list:

http://groups.google.com/group/dumbo-user/t/c9d368625daa2629

from dumbo.

dgleich commented on July 29, 2024

I fixed this issue by the following patch

--- a/dumbo/backends/streaming.py
+++ b/dumbo/backends/streaming.py
@@ -76,7 +76,7 @@ class StreamingIteration(Iteration):
         if modpath.endswith('.egg'):
             addedopts.add('libegg', modpath)
         else:
-            opts.add('file', modpath)
+            opts.add('file', 'file://' + modpath)
         opts.add('jobconf', 'stream.map.input=typedbytes')
         opts.add('jobconf', 'stream.reduce.input=typedbytes')

from dumbo.

VirtualEnv + Hadoop CDH3B4 mode + Dumbo = import site error (was: StreamJob fails -- error finding typedbytes.pyc even though it exists) about dumbo HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent