Giter VIP home page Giter VIP logo

Comments (12)

alfpark avatar alfpark commented on August 25, 2024

@andreadotti Can you confirm that the blobxfer docker container is running on the node where the task is stuck in preparing state? You can use jobs listtasks to see which node the task is assigned to. There can be a situation if the task fails during preparation phase that the task will stay stuck in preparing state if there is no node available to run the task (which can happen if all nodes are busy or if this preparation task has failed on all nodes).

Also, it could be "stuck" computing MD5 for files (it's not actually stuck but it's taking a long time) that are very large. Can you try setting the blobxfer extra options property to not compute md5? E.g., "blobxfer_extra_options": "--no-computefilemd5"

from batch-shipyard.

andreadotti avatar andreadotti commented on August 25, 2024

Hi @alfpark ,
I don't think the problem was in a long computing MD5, because the file is very small O(100kB). I ssh'd on the machine and could see from top that no process was using significant amount of CPU/memory.
I then killed the task and jobs continued working correctly. The remaining tasks are now running.

A strange thing I've noticed that could be related. I've setup on azure portal a metric to monitor the Node Count this metric dropped to zero (together with others number like tasks completed/started) around the same time the issue with the stuck job appeared. From the azure portal it seems I do not have jobs running anymore and a pool w/ 0 cores deployed. But instead I can see them running and the VMs are actually up.

Finally I noticed an early open issue regarding Ubuntu SKU 16.04.0-LTS. I'm using the same image.

Thank you,
Andrea

from batch-shipyard.

alfpark avatar alfpark commented on August 25, 2024

@andreadotti Glad to hear you've mitigated it on your end. A few more questions:

  1. Do you have the task retry count set to something positive?
  2. Did you happen to also look at the stderr.txt file for the task to see if anything was present in that file?

A possible explanation could be that the task fails during upload and the output of the failure is logged to stderr.txt (so you may not have noticed an error). This task is retried by the system and all nodes are busy (although I believe the task state should be reset back to active when it's retried).

Regarding your metric monitoring node count going to zero, did you happen to run pool list and pool listnodes locally to see what was going on outside of Portal? This sounds like very strange behavior and if it continues to occur (or is still ongoing), then you will need to open a support ticket on Azure Portal so it can be diagnosed further with the context of your account, etc.

from batch-shipyard.

andreadotti avatar andreadotti commented on August 25, 2024

@alfpark ,

  1. I did not set the task retry count. Thanks for the tip, I'll do that.
  2. the stderr.txt was empty.

Regarding the metric. Both pool list and pool listnodes give valid results and a list of nodes that are up and running. Even from the portal I can navigate through blades and get the list of nodes up and running. I can also ssh on them correctly. It is just the metric in the portal that is wrong. I will open a ticket since it seems related only to that problem.

Andrea

from batch-shipyard.

alfpark avatar alfpark commented on August 25, 2024

@andreadotti If you encounter the other issue again, you can also raise a support ticket to the Batch service noting your account name, region, approximate time (you can use the state transition time on the task), pool name, job and task name, and the description of the problem.

from batch-shipyard.

andreadotti avatar andreadotti commented on August 25, 2024

@alfpark , so my tasks are done and I can see now, in a quiet environment, that I have two cases of such stuck jobs. The symptoms are exactly the same, the job is done, no error message from my side, last line of stdout.txt shows that blobxfer is stuck. stderr.txt is empty.

Do you have any test that I could perform before starting my next round of jobs? I can ssh on the VMs and do some testing there, but I do not know what to look for...

Andrea

from batch-shipyard.

alfpark avatar alfpark commented on August 25, 2024

@andreadotti If you do jobs listtasks and find the node id of the tasks that are stuck, please use pool ssh --nodeid <node id of stuck task>. While on the node, you can issue sudo docker ps -a.

If you could copy and paste the jobs listtasks output (redact anything sensitive if it is output) and the sudo docker ps -a for each stuck task, that would be helpful.

Additionally, you can also issue data listfiles --jobid <jobid> --taskid <taskid> and ensure that the files that are intended to be uploaded do exist for the task.

from batch-shipyard.

andreadotti avatar andreadotti commented on August 25, 2024

docker ps -a gives back:

CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
3ad52f0fb834        alfpark/blobxfer    "blobxfer geant4da..."   26 hours ago        Up 26 hours                             competent_murdock

The output file does exist, I could copy via scp to my local machine and it contains what I expect. I have it available you want to see it.

Andrea

from batch-shipyard.

andreadotti avatar andreadotti commented on August 25, 2024

Sorry, forgot to paste this:

# shipyard data listfiles --jobid singleinteractions-QGSP_BERT-HadInel --taskid 9502 --configdir .
2017-03-02 20:05:39,813Z INFO convoy.batch:list_task_files:1580 task_id=9502 file=stdout.txt [job_id=singleinteractions-QGSP_BERT-HadInel lmt=2017-03-01 17:21:23.579452+00:00 bytes=118135]
2017-03-02 20:05:39,813Z INFO convoy.batch:list_task_files:1580 task_id=9502 file=wd/QGSP_BERT-HadInel-proton-1GeV-G4_Na.tgz [job_id=singleinteractions-QGSP_BERT-HadInel lmt=2017-03-01 17:21:22.775430+00:00 bytes=472857]
2017-03-02 20:05:39,813Z INFO convoy.batch:list_task_files:1580 task_id=9502 file=wd/.shipyard.envlist [job_id=singleinteractions-QGSP_BERT-HadInel lmt=2017-03-01 17:21:10.151088+00:00 bytes=1059]
2017-03-02 20:05:39,832Z INFO convoy.batch:list_task_files:1580 task_id=9502 file=stderr.txt [job_id=singleinteractions-QGSP_BERT-HadInel lmt=2017-03-01 17:21:10.099086+00:00 bytes=0]

The outputfile is the wd/*.tgz archive

from batch-shipyard.

alfpark avatar alfpark commented on August 25, 2024

@andreadotti Thanks for the output. This does appear to be an issue with blobxfer rather than Batch or Batch Shipyard. Outside of execing into the container and attaching gdb to the running blobxfer process, I'm not sure what the exact cause would be to lead blobxfer to hang like that. It could be that your destination storage container has lots of blobs in it (which could potentially cause this behavior).

Can you try modifying your output_data block and set this option: "blobxfer_extra_options": "--no-computefilemd5 --no-skiponmatch"

This will prevent blobxfer from downloading the list of blobs in the container as a pre-check. You will still be protected with the integrity of the data on the wire on a per-request level through TLS (https).

You can also think about adding a --timeout parameter to the above configuration property as well.

from batch-shipyard.

andreadotti avatar andreadotti commented on August 25, 2024

@alfpark in my case the container is pretty large 30k objects. I will try the option you suggested for the next round of tests. In any case this is relatively rare: I got 4 cases in 30k tasks.

from batch-shipyard.

alfpark avatar alfpark commented on August 25, 2024

Closing, please re-open if necessary.

from batch-shipyard.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.