Hi, I was reading <a href="https://huggingface.co/blog/sagemaker-dis

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Foud something at line 9937 . it seems <code class="no

No, it should work with p3.16xlarge to try to use <co

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Error when training BART/T5 for summarization using transformers and amazon sageMaker about blog HOT 12 OPEN

huggingface commented on May 19, 2024

Error when training BART/T5 for summarization using transformers and amazon sageMaker

from blog.

Comments (12)

philschmid commented on May 19, 2024

Hello @wry818,
Can you try restarting the job again? This looks like an internal SageMaker Issue.
Additionally, could you put the logs into a code block (```) It is super hard to read this way.

from blog.

wry818 commented on May 19, 2024

@philschmid
Yes. I rerun the job but SageMaker threw the same error.
I've updated my post, added the code block.

Thanks for the quick reply.

from blog.

philschmid commented on May 19, 2024

is [1,0]<stderr>:Using amp fp16 backend the first line in the log? Could please send the whole log the MPI errors are often hidden and way above before it stops.

Additionally, could you add your hyperparamters?

from blog.

wry818 commented on May 19, 2024

I used same hyperparamters written in the blog.
I've attached both .ipynb file and full output log for you.

.ipynb zip file:
hugging-face-training.zip

output log file:
log_output.txt

from blog.

philschmid commented on May 19, 2024

Foud something at line 9937. it seems p3.16xlarge has less memory/ needs more to distribute the data.

[1,0]<stdout>:RuntimeError: CUDA out of memory. Tried to allocate 28.00 MiB (GPU 0; 15.78 GiB total capacity; 14.20 GiB already allocated; 9.75 MiB free; 14.56 GiB reserved in total by PyTorch)
2021-05-28 09:03:20,489 sagemaker-training-toolkit INFO     Orted process exited

Could try to decrease the batch size?

from blog.

wry818 commented on May 19, 2024

Thanks for finding the problem.
About the batch size, Do you mean per_device_train_batch_size in hyperparamters? What about per_device_eval_batch_size? Should I decrease them both?

I'm fairly new to this kind of job.
How much batch size do you sugguest to decrease? Or do I really need a 'ml.p3dn.24xlarge' instance to run this training job?

from blog.

philschmid commented on May 19, 2024

No, it should work with p3.16xlarge to try to use 2 for both if this works, you could try 3.

from blog.

wry818 commented on May 19, 2024

Ok. One more question. After Sagemaker created model, is it possible to deploy it to SageMaker endpoint as well? If yes, how can I do it, could you point me to the right direction? This is my client's request.

from blog.

wry818 commented on May 19, 2024

@philschmid I've adjusted the parameters and run again. Looks like the training is complete. But still threw an error when uploading the model to sagemaker.

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2021-05-31-01-39-31-463: Failed. Reason: ClientError: Artifact upload failed:Insufficient disk space

Any suggestion? How large is the trained model?

from blog.

philschmid commented on May 19, 2024

Hey @wry818,
Sadly there is. currently no Hugging Face DLC for inference, but we are heavily working on it. I keep you posted on this.
The error seems to appear at the end of your training, when transformers is saving the model the disk space is full. You can increase it by adjusting the parameter volume_size in the HuggingFace estimator in sagemaker.

volume_size (int) – Size in GB of the EBS volume to use for storing input data during training (default: 30). Must be large enough to store training data if File Mode is used (which is the default).

[REF]
Additionally is transformers saving checkpoints after 500 steps

save_steps (int, optional, defaults to 500) – Number of updates steps before two checkpoint saves if save_strategy="steps".

[REF]
You could increase this so save a lower amount of checkpoints and save storage

from blog.

wry818 commented on May 19, 2024

@philschmid Thanks, about these two parameters.
How much volume_size and save_steps do you suggest I should increase for the estimator, so the model can be uploaded successfully?

from blog.

philschmid commented on May 19, 2024

This depends on your needs, but increasingvolume_size is pretty easy and 200GB costs around 0.32$ per hour.

from blog.

Error when training BART/T5 for summarization using transformers and amazon sageMaker about blog HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent