What saves every 4000 points is great, but what about the continuation after the failu

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Checkpoint and Continuation of training about morpheus HOT 5 CLOSED

asigalov61 commented on May 28, 2024

Checkpoint and Continuation of training

from morpheus.

Comments (5)

Landers125 commented on May 28, 2024

For the free Kolab K80 put a batch sise 8, 3-4 hours train, then breaks off. How would we continue?

from morpheus.

asigalov61 commented on May 28, 2024

@Landers125 Great question!!! I was thinking about implementing state save so that training can be continued after the failure. It would be quite useful indeed.

Unfortunately, it is not very straightforward to do so atm it is not possible AFAIK. Sorry :(

I will definitely post an update to the implementation if I will ever do it...

For now, you can try using paperspace.com (they give like 6 hour free runtimes), use a smaller dataset/model size, and you can also find an inexpensive GPU plan or something like that...

Hope this answers your questions.

Alex

from morpheus.

Landers125 commented on May 28, 2024

@Landers125 Great question!!! I was thinking about implementing state save so that training can be continued after the failure. It would be quite useful indeed.

Unfortunately, it is not very straightforward to do so atm it is not possible AFAIK. Sorry :(

I will definitely post an update to the implementation if I will ever do it...

For now, you can try using paperspace.com (they give like 6 hour free runtimes), use a smaller dataset/model size, and you can also find an inexpensive GPU plan or something like that...

Hope this answers your questions.

Alex

number_of_batches = 14

Thanks! Kaggle has launched a training session.

You are doing a very useful thing!

from morpheus.

asigalov61 commented on May 28, 2024

@Landers125 Thank you. I am happy that you enjoy my work. It means a lot to me :)

Yes, Kaggle and some other companies like paperspace offer GPU plans/free GPUs that are better than Google. I am happy you found a good solution for your needs.

Alex

from morpheus.

asigalov61 commented on May 28, 2024

@Landers125 Btw, you can technically restart the training after failure by loading the last checkpoint and the original dataset.

You can even set the final learning rate in the training section of the code.

The problem is that it will start training from the beginning of the dataset which will be kinda redundant and not very effective.

I will look into it some more soon I hope and I will add it to the implementation if it will be possible.

from morpheus.

Checkpoint and Continuation of training about morpheus HOT 5 CLOSED

Comments (5)

Related Issues (2)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent