I am training a 120M model from scratch because I would like to do some experiments my

Hi, Artnoage. If you check the <a href="https://wandb.ai/lance777/lightning_logs/repor

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Resuming training about tinyllama HOT 8 OPEN

jzhang38 commented on August 28, 2024

Resuming training

from tinyllama.

Comments (8)

jzhang38 commented on August 28, 2024

Hi, Artnoage. If you check the training log I actually resumed the process twice and did not notice any memory error. I am not sure why that is the case on your end.

Thanks for your suggestions on opening a discord. I think I will open one soon.

from tinyllama.

artnoage commented on August 28, 2024

Yes I already knew that. I was just thinking maybe it had to do with the size of the network. like some missed parameter. When you did the first run, did you check the memory usage? If it is not a bit issue please leave the question open for a while, in case someone else tries it.

from tinyllama.

jzhang38 commented on August 28, 2024

When you did the first run, did you check the memory usage?

The memory usage is always 39G on my end.

from tinyllama.

dtxwhzw commented on August 28, 2024

Hi! My training crashed, and I couldn't find the code to resume training from the last saved checkpoint. How can I resume my training? How do you handle this?

from tinyllama.

ChaosCodes commented on August 28, 2024

Hi! My training crashed, and I couldn't find the code to resume training from the last saved checkpoint. How can I resume my training? How do you handle this?

You can add --resume your_checkpoint.pth term in your pretraining command to resume training

from tinyllama.

dtxwhzw commented on August 28, 2024

Hi! My training crashed, and I couldn't find the code to resume training from the last saved checkpoint. How can I resume my training? How do you handle this?

You can add --resume your_checkpoint.pth term in your pretraining command to resume training

thanks, i found that

from tinyllama.

dustinwloring1988 commented on August 28, 2024

@artnoage can you please post this project I would like to try to he same with 2 4060 training at home.

from tinyllama.

artnoage commented on August 28, 2024

I am not so sure what are you referring to because it is been a while. However if you like to have a quick chat over what you want to do, you can find me in discord with the same name (artnoage)

…

On Mon, 20 May 2024, 16:35 Dustin Loring, ***@***.***> wrote: @artnoage <https://github.com/artnoage> can you please post this project I would like to try to he same with 2 4060 training at home. — Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AECUDGARZ75OZYTJUWTRBI3ZDH3ZTAVCNFSM6AAAAAA5CTY2CKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRQGQ3TMMRXG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from tinyllama.

Resuming training about tinyllama HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent