Comments (4)
I'm running multiple jobs in AWS batch using metaflow. And recently I too have been noticing there are sporadic failures of jobs with exit code 137 in AWS batch.
I understand exit code 137 indicates memory issue, but we are seeing this error occasionally.
We tested by giving 16 GB memory and 128 GB memory for 2 jobs and passing same payload. It passed for the job with 16 GB once but it failed for 128 GB, so we are not sure if it's actually a memory issue.
Is there any chance that this is a metaflow related issue because the error we are seeing is:
Data store error:
No completed attempts of the task was found for task
I checked in task_datastory.py file of this repository and noticed this error is thrown if 'Done.lock' file is not created.
We tested this with various of metaflow including 2.9.11 and 2.10.7 and we are seeing the same error on all of these versions.
from metaflow.
@bjupreti are you able to replicate this error consistently?
from metaflow.
No, I'm not able to replicate it consistently. The same batch job with same compute environment, resources and payload passes sometimes and fails sometimes.
from metaflow.
In my case, I checked the logs of EC2 instances, as part of EC2 startup SSM scripts were running which resulted in restarting the docker daemon and the running job container gets stopped. ECS service comes back up, sees the container was stopped, and informs AWS that the job failed. Once the SSM scripts were not run on the instances, ECS agent service did not restart and I'm not seeing the above error anymore.
from metaflow.
Related Issues (20)
- Make default kubernetes resources configurable HOT 2
- `python hello.py batch step --help` is broken
- Invalid CloudWatch log group prefix HOT 6
- Allow namespacing of `ArgoEvent` when published from a step
- allow uploading metaflow packages that include "dots" in the path
- Suggest replacing `pull_request_target` in branches as well as `main`
- Bug: Passing an `S3PutObject` to `s3.put_files` treated as `tuple` of key path values.
- Allow passing of `trusted-host` parameter to `@pypi` decorator.
- Unable to utilise pytest unit test for metaflow HOT 1
- Update extras_require for tracing dependencies?
- R tests on Github fail with macos-latest runner
- Argo-workflows with name > 52 characters fail if run as CronWorkflows HOT 1
- Introduce support for private registry for AWS Batch
- Write metadata about log scrubbing events
- Language-agnostic API and/or other language SDKs for the client API HOT 1
- Performance Degradation When Running PyCaret Model with n_jobs > 1 Inside Metaflow
- Question on Executing Metaflow Workflow from Python Script Without 'run' Argument HOT 3
- Resume Doesn't Resume from the Last Failed Step
- Argo Events: trigger sensors are not deleted
- Allow option to set labels on pods using `kubernetes` decorator
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from metaflow.