Comments (8)
Would it be more appropriate to call out to the ssh
utility or to add a dependency on something like paramiko?
from mrjob.
I'm refactoring the S3 and SSH log fetcher functionality to subclass LogFetcher
in a new submodule.
from mrjob.logfetch.ssh import SSHLogFetcher
# etc.
This will probably also involve breaking a lot of S3-related code out of EMRJobRunner
, which probably isn't a bad thing since that class is currently a couple thousand lines long.
from mrjob.
Yup, that sounds good. Another good way to approach this is to start out by building a standalone utility (in mrjob.tools.emr
) that fetches and analyzes logs, and then patch it into EMRJobRunner
.
And please, use scp
; don't add another library dependency. :)
from mrjob.
Can do. My current strategy is to copy any relevant functions (ls/get from S3, local, and SSH + dependencies) into instance methods and helpers for fetchers so that logfetch
can be used independently. Then I will write a tool around it, verify by hand on various cases, add mocking for SSH + automated tests, and finally insert it into EMRJobRunner
, removing redundant functions.
from mrjob.
Sounds like a good plan.
from mrjob.
New info: logs have slightly different paths on S3 vs local. Here's a quickref I'll put in the comments:
S3 location Local location
/daemons / (root)
/jobs /history
/node <not present>
/steps /steps
/task-attempts /userlogs
from mrjob.
I believe this can be closed unless it also encompasses a log fetching/parsing refactor.
from mrjob.
Yup, thanks!
from mrjob.
Related Issues (20)
- upgrade boto3/botocore to support StepConcurrencyLevel HOT 2
- fetching progress from resource manager shouldn't rely on SSH tunnel
- progress indicators are wrong when steps run simultaneously HOT 1
- useless return value from make_pooled_cluster() in pooling tests
- pool_wait_minutes shouldn't wait if pool is empty
- add pool_timeout_minutes option
- add pool_jitter_seconds option HOT 1
- Error when running on hadoop "Found 2 unexpected arguments on the command line" HOT 1
- add_passthru_arg on hadoop
- It possible to prevent decompression and/or splitting in local or inline mode
- Can I write map and reduce in many different class?
- Assign tags on EMR creation in single API call
- ignore unrecognized arguments HOT 1
- code breaks locally but runs fine remotely on hadoop cluster HOT 2
- Hadoop counter in mrjob
- trying to run mr job python script
- Failure to run mrjob on dataproc
- total sort HOT 1
- Read Specific Column From csv file
- Python 3.12 support
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mrjob.