Comments (12)
Hello @wry818,
Can you try restarting the job again? This looks like an internal SageMaker Issue.
Additionally, could you put the logs into a code block (```) It is super hard to read this way.
from blog.
@philschmid
Yes. I rerun the job but SageMaker threw the same error.
I've updated my post, added the code block.
Thanks for the quick reply.
from blog.
is [1,0]<stderr>:Using amp fp16 backend
the first line in the log? Could please send the whole log the MPI
errors are often hidden and way above before it stops.
Additionally, could you add your hyperparamters
?
from blog.
I used same hyperparamters written in the blog.
I've attached both .ipynb file and full output log for you.
.ipynb zip file:
hugging-face-training.zip
output log file:
log_output.txt
from blog.
Foud something at line 9937
. it seems p3.16xlarge
has less memory/ needs more to distribute the data.
[1,0]<stdout>:RuntimeError: CUDA out of memory. Tried to allocate 28.00 MiB (GPU 0; 15.78 GiB total capacity; 14.20 GiB already allocated; 9.75 MiB free; 14.56 GiB reserved in total by PyTorch)
2021-05-28 09:03:20,489 sagemaker-training-toolkit INFO Orted process exited
Could try to decrease the batch size?
from blog.
Thanks for finding the problem.
About the batch size, Do you mean per_device_train_batch_size
in hyperparamters
? What about per_device_eval_batch_size
? Should I decrease them both?
I'm fairly new to this kind of job.
How much batch size do you sugguest to decrease? Or do I really need a 'ml.p3dn.24xlarge' instance to run this training job?
from blog.
No, it should work with p3.16xlarge
to try to use 2
for both if this works, you could try 3
.
from blog.
Ok. One more question. After Sagemaker created model, is it possible to deploy it to SageMaker endpoint as well? If yes, how can I do it, could you point me to the right direction? This is my client's request.
from blog.
@philschmid I've adjusted the parameters and run again. Looks like the training is complete. But still threw an error when uploading the model to sagemaker.
UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2021-05-31-01-39-31-463: Failed. Reason: ClientError: Artifact upload failed:Insufficient disk space
Any suggestion? How large is the trained model?
from blog.
Hey @wry818,
Sadly there is. currently no Hugging Face DLC for inference, but we are heavily working on it. I keep you posted on this.
The error seems to appear at the end of your training, when transformers
is saving the model the disk space is full. You can increase it by adjusting the parameter volume_size
in the HuggingFace
estimator in sagemaker.
volume_size (int) – Size in GB of the EBS volume to use for storing input data during training (default: 30). Must be large enough to store training data if File Mode is used (which is the default).
[REF]
Additionally is transformers
saving checkpoints after 500
steps
save_steps (int, optional, defaults to 500) – Number of updates steps before two checkpoint saves if save_strategy="steps".
[REF]
You could increase this so save a lower amount of checkpoints and save storage
from blog.
@philschmid Thanks, about these two parameters.
How much volume_size
and save_steps
do you suggest I should increase for the estimator, so the model can be uploaded successfully?
from blog.
This depends on your needs, but increasingvolume_size
is pretty easy and 200GB costs around 0.32$ per hour.
from blog.
Related Issues (20)
- Broken link in the PEFT blog post HOT 1
- Can not execute example in idefics-9b-instruct
- [w2v-bert] Questions about average duration by token
- Are images cached? HOT 7
- Codelab issue Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers HOT 2
- The training procedure of probabilistic time series forecasting is wrong 🔴
- only classifier head is trained in tweet sentiment classification LoRA finetuning blog
- Gemma Hugging face generate error HOT 2
- gemma: jax/ flax version HOT 2
- Decision Transformers: Error in Trainer HOT 7
- Pip Install Failure: 101_train-decision-transformers.ipynb HOT 2
- Is the Sightseer model open source? HOT 1
- Please update the blog to fix: "Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription" HOT 9
- [bug] zh/ blog titles displaying in English & missing translation links HOT 6
- reload the command for every Send
- personal access token doest work
- gemma fine tuning blog: formatting_func never called HOT 4
- Add Llama3! HOT 4
- PerceiverIO multimodal tensor sizes
- Citation needed HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from blog.