Giter VIP home page Giter VIP logo

Comments (5)

Landers125 avatar Landers125 commented on May 28, 2024

For the free Kolab K80 put a batch sise 8, 3-4 hours train, then breaks off. How would we continue?

from morpheus.

asigalov61 avatar asigalov61 commented on May 28, 2024

@Landers125 Great question!!! I was thinking about implementing state save so that training can be continued after the failure. It would be quite useful indeed.

Unfortunately, it is not very straightforward to do so atm it is not possible AFAIK. Sorry :(

I will definitely post an update to the implementation if I will ever do it...

For now, you can try using paperspace.com (they give like 6 hour free runtimes), use a smaller dataset/model size, and you can also find an inexpensive GPU plan or something like that...

Hope this answers your questions.

Alex

from morpheus.

Landers125 avatar Landers125 commented on May 28, 2024

@Landers125 Great question!!! I was thinking about implementing state save so that training can be continued after the failure. It would be quite useful indeed.

Unfortunately, it is not very straightforward to do so atm it is not possible AFAIK. Sorry :(

I will definitely post an update to the implementation if I will ever do it...

For now, you can try using paperspace.com (they give like 6 hour free runtimes), use a smaller dataset/model size, and you can also find an inexpensive GPU plan or something like that...

Hope this answers your questions.

Alex

number_of_batches = 14
2022-01-12_20-47-10
Thanks! Kaggle has launched a training session.

You are doing a very useful thing!

from morpheus.

asigalov61 avatar asigalov61 commented on May 28, 2024

@Landers125 Thank you. I am happy that you enjoy my work. It means a lot to me :)

Yes, Kaggle and some other companies like paperspace offer GPU plans/free GPUs that are better than Google. I am happy you found a good solution for your needs.

Alex

from morpheus.

asigalov61 avatar asigalov61 commented on May 28, 2024

@Landers125 Btw, you can technically restart the training after failure by loading the last checkpoint and the original dataset.

You can even set the final learning rate in the training section of the code.

The problem is that it will start training from the beginning of the dataset which will be kinda redundant and not very effective.

I will look into it some more soon I hope and I will add it to the implementation if it will be possible.

from morpheus.

Related Issues (2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.