Giter VIP home page Giter VIP logo

Comments (6)

hectcastro avatar hectcastro commented on May 23, 2024

This could be related to moby/moby#34213.

from raster-vision.

lewfish avatar lewfish commented on May 23, 2024

This just happened for a job (job-id=9739ed8b-8125-4e75-9efb-fcf1d89f5c44) I ran today, which was still running after initiating termination via the consolet 1.5h prior. The command was

run_script.sh, lf/train-ships, /opt/src/detection/scripts/train_ec2.sh --config-path /opt/src/detection/configs/ships/ssd_mobilenet_v1.config --train-id ships1 --dataset-id singapore_ships_chips_tiny --model-id ssd_mobilenet_v1_coco_11_06_2017

and the container was 279682201306.dkr.ecr.us-east-1.amazonaws.com/raster-vision-gpu:latest running in the raster-vision-gpu queue.

from raster-vision.

lewfish avatar lewfish commented on May 23, 2024

The easiest workaround for now is to kill the container running on the host.

[ec2-user@ip-172-31-53-167 ~]$ sudo docker ps
CONTAINER ID        IMAGE                                                                   COMMAND                  CREATED             STATUS              PORTS               NAMES
c98db2c40493        279682201306.dkr.ecr.us-east-1.amazonaws.com/raster-vision-gpu:latest   "bash run_script.sh l"   18 minutes ago      Up 18 minutes                           ecs-raster-vision-gpu-3-default-90f9cfa59cb5d09bbe01
4d7350e07e59        amazon/amazon-ecs-agent:latest                                          "/agent"                 34 minutes ago      Up 34 minutes                           ecs-agent
[ec2-user@ip-172-31-53-167 ~]$ sudo docker kill c98db2c40493
c98db2c40493

from raster-vision.

lewfish avatar lewfish commented on May 23, 2024

If you terminate a job using the above approach, it will get retried if attempts is greater than 1. Since it's already a pain to terminate things, and resuming training is broken as described in #106, we should remember to submit jobs with --attempts 1 so there is no retry attempt.

from raster-vision.

tnation14 avatar tnation14 commented on May 23, 2024

I think the reason that we aren't able to stop Batch Jobs is partly because when the Job receives a SIGINT signal from the Batch console, that signal isn't being propagated to the background processes running in the container. Those processes keep running, so bash can't exit and end the job. I was able to kill Batch jobs from the console using the changes that I made to train_ec2.sh on my branch feature/tnation/batch-job-termination. However, once the Job termination issue was fixed, I noticed that if any of the background processes (i.e. train.py and eval.py) exit without being killed, the Batch job will still hang until it's terminated manually. I think that's because the EXIT signal in the background process isn't being passed back up to the script.

It seems like managing both job control and model training with train_ec2.sh will to require a lot of nonstandard scripting to make it work the way it should; it may be worth it to break this task up into multiple, dependent Batch jobs. One job could run train.py, and the other can run eval.py. If we use either EFS or S3 to store the necessary shared files, we'll be able to create a more robust process that will keep us from having to do process management and error checking with a shell script.

from raster-vision.

tnation14 avatar tnation14 commented on May 23, 2024

Since we've identified the root cause for the job failures (and train_ec2.sh is being rewritten soon), I'm going to close this.

from raster-vision.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.