Comments (8)
Hi @ldh127 - can you please be more specific, share more about what you are trying to do and what errors you are hitting?
from deepspeed.
@ldh127, does the following help?
https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html#zero-checkpoint-fp32-weights-recovery
from deepspeed.
from deepspeed.
Hi @ldh127 - can you please be more specific, share more about what you are trying to do and what errors you are hitting?
yes ,i use transformers trainer to call deepspeed , it save the deepspeed checkpoint which contains multi gpu model and optim file , i want just one file optim.pt file to choosing sft data , my code can only load one global optim.pt , but deepspeed checkpoint get multi part optim and model file , how can i. merge multi optim file to one global file ?
from deepspeed.
yes,but i think this code is for ds2universe model param ,not for merging multi optim file into to one file , it can process merge deepspeed multi gpu optim file into one pytorch optim.pt file? ldh @.***
…
@ldh127, why do you say the link is related to ds2universal? Did you try it? Can you clarify how your scenario is different from the use case below? Thanks!
from deepspeed.
yes,but i think this code is for ds2universe model param ,not for merging multi optim file into to one file , it can process merge deepspeed multi gpu optim file into one pytorch optim.pt file? ldh @.***
…@ldh127, why do you say the link is related to ds2universal? Did you try it? Can you clarify how your scenario is different from the use case below? Thanks!
yes , i try this code , finally i surely get only one .pth file, but you can see my details, i
this is my deepspeed checkpoint file, i use your code to read this folder ,and finally it merge and save only one file , i use this code ,you can see
,and i get the file like this ,
, you can see that i print the state_dict name, it is like base_model.model.model.layers.38.self_attn.q_proj.lora_A.default.weight
base_model.model.model.layers.38.self_attn.q_proj.lora_B.default.weight
base_model.model.model.layers.38.self_attn.k_proj.lora_A.default.weight
base_model.model.model.layers.38.self_attn.k_proj.lora_B.default.weight
, but it seems the models name ,not the optim file name ?
from deepspeed.
yes,but i think this code is for ds2universe model param ,not for merging multi optim file into to one file , it can process merge deepspeed multi gpu optim file into one pytorch optim.pt file? ldh @.***
…@ldh127, why do you say the link is related to ds2universal? Did you try it? Can you clarify how your scenario is different from the use case below? Thanks!
you can see the uppon picture , if the finally file which named demo_state_dict.pth contains optim param ,but how can i get the optim state_dict ? if it is the merged optim file , it seems i can use state_dict["optim_state"] like this way to get the only one optim dict ,but it has no optim_state key in the dict , so i donot konw what error in my operate steps
from deepspeed.
yes,but i think this code is for ds2universe model param ,not for merging multi optim file into to one file , it can process merge deepspeed multi gpu optim file into one pytorch optim.pt file? ldh @.***
…@ldh127, why do you say the link is related to ds2universal? Did you try it? Can you clarify how your scenario is different from the use case below? Thanks!
i also read the code in this url: https://github.com/microsoft/DeepSpeed/blob/4c15ad9f8d51a1950842c69bbbc9d93c73afbcfc/deepspeed/utils/zero_to_fp32.py , but i do not know if i need to update what code , can you give me more detail help? thanks , need some detail
from deepspeed.
Related Issues (20)
- [BUG] (flops_profiler) Duplicate registration check for start_time_hook is not working
- [BUG: Whisper model pipeline parallel training] logits and ground truth size mismatch during loss calculation
- [Q&A] Why Deepspeed Ulysses could support long sequence length?
- Why not save frozen params unless: `self.zero_optimization_stage() >= ZeroStageEnum.gradients`? HOT 2
- [REQUEST]I do not understand the meaning of ' reduction ' in the ZERO++ paper. HOT 1
- Deepspeed module not being able to install in the WSL environment HOT 2
- Cannot create wheel for version 0.14.2 on Windows HOT 1
- [BUG] Unable to load CLIPVisionModel parameters properly in Zero Stage 3 HOT 2
- [BUG] Can't pickle local object 'instrument_w_nvtx.<locals>.wrapped_fn' HOT 4
- [BUG] Multi-gpu training is much lower than single gpu (due to additional processes?)
- [REQUEST] Remove scary warnings from deepspeed import
- [BUG]output tensor must have the same type as input tensor in PPO training script of TRL HOT 4
- [REQUEST] How can one specify the CPU architecture to target. HOT 2
- how to gather checkpoints to master node during multi-nodes training HOT 6
- [BUG] [Regression] Adam Offload Runtime Error with DeepSpeed v0.14.2 HOT 3
- [REQUEST] too many unrelated warning HOT 1
- [REQUEST] Use python sysconfig to generate CFLAGs HOT 1
- RuntimeError: cannot pin 'CUDABFloat16Type' only dense CPU tensors can be pinned HOT 2
- JIT build fails for ROCM 6.0 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeed.