Giter VIP home page Giter VIP logo

Comments (12)

philschmid avatar philschmid commented on May 19, 2024

Hello @wry818,
Can you try restarting the job again? This looks like an internal SageMaker Issue.
Additionally, could you put the logs into a code block (```) It is super hard to read this way.

from blog.

wry818 avatar wry818 commented on May 19, 2024

@philschmid
Yes. I rerun the job but SageMaker threw the same error.
I've updated my post, added the code block.

Thanks for the quick reply.

from blog.

philschmid avatar philschmid commented on May 19, 2024

is [1,0]<stderr>:Using amp fp16 backend the first line in the log? Could please send the whole log the MPI errors are often hidden and way above before it stops.

Additionally, could you add your hyperparamters?

from blog.

wry818 avatar wry818 commented on May 19, 2024

I used same hyperparamters written in the blog.
I've attached both .ipynb file and full output log for you.

.ipynb zip file:
hugging-face-training.zip

output log file:
log_output.txt

from blog.

philschmid avatar philschmid commented on May 19, 2024

Foud something at line 9937. it seems p3.16xlarge has less memory/ needs more to distribute the data.

[1,0]<stdout>:RuntimeError: CUDA out of memory. Tried to allocate 28.00 MiB (GPU 0; 15.78 GiB total capacity; 14.20 GiB already allocated; 9.75 MiB free; 14.56 GiB reserved in total by PyTorch)
2021-05-28 09:03:20,489 sagemaker-training-toolkit INFO     Orted process exited

Could try to decrease the batch size?

from blog.

wry818 avatar wry818 commented on May 19, 2024

Thanks for finding the problem.
About the batch size, Do you mean per_device_train_batch_size in hyperparamters? What about per_device_eval_batch_size? Should I decrease them both?

I'm fairly new to this kind of job.
How much batch size do you sugguest to decrease? Or do I really need a 'ml.p3dn.24xlarge' instance to run this training job?

from blog.

philschmid avatar philschmid commented on May 19, 2024

No, it should work with p3.16xlarge to try to use 2 for both if this works, you could try 3.

from blog.

wry818 avatar wry818 commented on May 19, 2024

Ok. One more question. After Sagemaker created model, is it possible to deploy it to SageMaker endpoint as well? If yes, how can I do it, could you point me to the right direction? This is my client's request.

from blog.

wry818 avatar wry818 commented on May 19, 2024

@philschmid I've adjusted the parameters and run again. Looks like the training is complete. But still threw an error when uploading the model to sagemaker.

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2021-05-31-01-39-31-463: Failed. Reason: ClientError: Artifact upload failed:Insufficient disk space

Any suggestion? How large is the trained model?

from blog.

philschmid avatar philschmid commented on May 19, 2024

Hey @wry818,
Sadly there is. currently no Hugging Face DLC for inference, but we are heavily working on it. I keep you posted on this.
The error seems to appear at the end of your training, when transformers is saving the model the disk space is full. You can increase it by adjusting the parameter volume_size in the HuggingFace estimator in sagemaker.

volume_size (int) – Size in GB of the EBS volume to use for storing input data during training (default: 30). Must be large enough to store training data if File Mode is used (which is the default).

[REF]
Additionally is transformers saving checkpoints after 500 steps

save_steps (int, optional, defaults to 500) – Number of updates steps before two checkpoint saves if save_strategy="steps".

[REF]
You could increase this so save a lower amount of checkpoints and save storage

from blog.

wry818 avatar wry818 commented on May 19, 2024

@philschmid Thanks, about these two parameters.
How much volume_size and save_steps do you suggest I should increase for the estimator, so the model can be uploaded successfully?

from blog.

philschmid avatar philschmid commented on May 19, 2024

This depends on your needs, but increasingvolume_size is pretty easy and 200GB costs around 0.32$ per hour.

from blog.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.