Comments (6)
This could be related to moby/moby#34213.
from raster-vision.
This just happened for a job (job-id=9739ed8b-8125-4e75-9efb-fcf1d89f5c44) I ran today, which was still running after initiating termination via the consolet 1.5h prior. The command was
run_script.sh, lf/train-ships, /opt/src/detection/scripts/train_ec2.sh --config-path /opt/src/detection/configs/ships/ssd_mobilenet_v1.config --train-id ships1 --dataset-id singapore_ships_chips_tiny --model-id ssd_mobilenet_v1_coco_11_06_2017
and the container was 279682201306.dkr.ecr.us-east-1.amazonaws.com/raster-vision-gpu:latest
running in the raster-vision-gpu
queue.
from raster-vision.
The easiest workaround for now is to kill the container running on the host.
[ec2-user@ip-172-31-53-167 ~]$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c98db2c40493 279682201306.dkr.ecr.us-east-1.amazonaws.com/raster-vision-gpu:latest "bash run_script.sh l" 18 minutes ago Up 18 minutes ecs-raster-vision-gpu-3-default-90f9cfa59cb5d09bbe01
4d7350e07e59 amazon/amazon-ecs-agent:latest "/agent" 34 minutes ago Up 34 minutes ecs-agent
[ec2-user@ip-172-31-53-167 ~]$ sudo docker kill c98db2c40493
c98db2c40493
from raster-vision.
If you terminate a job using the above approach, it will get retried if attempts
is greater than 1. Since it's already a pain to terminate things, and resuming training is broken as described in #106, we should remember to submit jobs with --attempts 1
so there is no retry attempt.
from raster-vision.
I think the reason that we aren't able to stop Batch Jobs is partly because when the Job receives a SIGINT
signal from the Batch console, that signal isn't being propagated to the background processes running in the container. Those processes keep running, so bash
can't exit and end the job. I was able to kill Batch jobs from the console using the changes that I made to train_ec2.sh
on my branch feature/tnation/batch-job-termination
. However, once the Job termination issue was fixed, I noticed that if any of the background processes (i.e. train.py
and eval.py
) exit without being killed, the Batch job will still hang until it's terminated manually. I think that's because the EXIT
signal in the background process isn't being passed back up to the script.
It seems like managing both job control and model training with train_ec2.sh
will to require a lot of nonstandard scripting to make it work the way it should; it may be worth it to break this task up into multiple, dependent Batch jobs. One job could run train.py
, and the other can run eval.py
. If we use either EFS or S3 to store the necessary shared files, we'll be able to create a more robust process that will keep us from having to do process management and error checking with a shell script.
from raster-vision.
Since we've identified the root cause for the job failures (and train_ec2.sh
is being rewritten soon), I'm going to close this.
from raster-vision.
Related Issues (20)
- Inconsistent handling of extents HOT 2
- prediction output have position shift HOT 13
- Some tiles get incorrect when predicting but of 95% are ok HOT 9
- Support reading temporal data (i.e. time-series of multiple images of the same scene)
- Can't use a geojson as AOI in a SemanticSegmentationRandomWindowGeoDataset HOT 3
- ARM64 build currently broken HOT 1
- Unable to install RasterVision HOT 3
- Issues with using model bundle for prediction HOT 15
- Cannot import ClassConfig on Kaggle HOT 16
- Cannot save prediction using colors from ClassConfig HOT 4
- Improve unit test coverage of CLI and `Runner`s
- Cannot plot batch with ObjectDetectionVisualizer HOT 4
- Multi-temporal raster source visualizer fails when batch size is 1 HOT 2
- Make it possible to exclude "null" class labels from the computation of metrics HOT 3
- RuntimeError: expected scalar type Long but found Int HOT 10
- Allow user to specify AOI box filtering behavior in sliding window datasets HOT 1
- self._hds cannot be converted to a Python object for pickling HOT 2
- Semantic Segmentation Labels not initializing properly from predictions when extent provided HOT 2
- use my trained modle to prediction ,has wrong happened HOT 2
- RuntimeError: The size of tensor a (82) must match the size of tensor b (64) at non-singleton dimension 3 HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from raster-vision.