Comments (6)
Jason Davenport commented:
Hello! To make sure I can best assist you, may I please ask for more information? A copy of the code in question would be the best place to start, along with a detailed rundown of what environment you are running in.
Let me know! Thanks 🙂
from wandb.
Jason Davenport commented: Hello! To make sure I can best assist you, may I please ask for more information? A copy of the code in question would be the best place to start, along with a detailed rundown of what environment you are running in.
Let me know! Thanks 🙂
Hi, I've updated my question, can you look at it for me or is there any more information I need to provide?
from wandb.
Jason Davenport commented:
It sounds like you're encountering issues with multiple sweep IDs being created when using wandb.sweep
in a multi-GPU setup with PyTorch's Distributed Data Parallel (DDP). This typically happens due to the way the wandb.sweep
and wandb.agent
functions are initialized within the distributed environment, causing each process to start its own sweep. Here's how you can address this:
-
Centralize Sweep Initialization: To ensure that only one sweep ID is generated and used across all GPUs, initialize the
wandb.sweep
outside the distributed code execution. You can control this by checking if the process is the master process and then initializing the sweep. Here's a code snippet to illustrate this:import osimport wandbimport torch.distributed as distdef is_master():return dist.get_rank() == 0if name == "main":sweep_config = get_wandb_config()if is_master():sweep_id = wandb.sweep(sweep_config, project=sweep_config['project_name'])dist.barrier() # Make sure the master process has created the sweep before continuingwandb.agent(sweep_id, function=run, count=3)
In the above code:
is_master()
checks if the current process is the master process.dist.barrier()
is used to synchronize all processes, ensuring the sweep is created before any agent starts.
-
Configuration and Initialization within
run()
: Ensure thatwandb.init()
is called in a way that all processes can correctly log their outputs to the same project and sweep without creating new sweeps. This can be done by correctly passing thesweep_id
and making surewandb.init()
uses parameters that do not vary by process unless intended:def run():if is_master():wandb.init(project=sweep_config['project_name'], config=sweep_config, group="DDP")else:wandb.init(project=sweep_config['project_name'], config=sweep_config, group="DDP", reinit=True)
In the code snippet above:
reinit=True
is used for non-master processes to ensure thatwandb.init()
can be called multiple times across different processes.
sweep_config = { 'method': 'bayes', # Example method 'metric': { 'name': 'loss', 'goal': 'minimize' }, 'parameters': { 'learning_rate': { 'min': 0.001, 'max': 0.1 }, 'optimizer': { 'values': ['adam', 'sgd'] } }}
By organizing your code to ensure the sweep is created and handled correctly across all GPU processes, you should be able to use Weights & Biases sweeps effectively in your multi-GPU training setup. This approach will minimize redundancy in sweep ID creation and make your hyperparameter tuning more efficient and easier to manage.
Hopefully, this helps! Let me know if there is anything else I can assist you with 🙂
from wandb.
Jason Davenport commented:
Hi Internal,
We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.
Best,
Weights & Biases
from wandb.
Jason Davenport commented:
Hi Internal, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!
from wandb.
Jason Davenport commented: Hi Internal,
We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.
Best, Weights & Biases
Sorry about that, I'm busy with other things these days, I'll be the first to try out what you've suggested when it's done, and I'll be the first to let you know of any other problems or successful runs, thank you very much for your help!
from wandb.
Related Issues (20)
- Attribute error HOT 2
- [Q]W&B table function HOT 8
- tox -e mypy fail HOT 3
- [Feature]: Conda package for 0.17.2 or later HOT 2
- [Q] Log structured data into wandb.Table HOT 2
- [Q] How can i delete the organization in wandb? HOT 3
- [App]: Edit report draft does not work HOT 4
- [App]: `Ignore outliers in chart scaling` not functioning HOT 3
- [App]: W&B Run Overview does not show Table properly HOT 1
- [Q] Edit Query in Custom Charts HOT 3
- [CLI]: Network error (TransientError), entering retry loop. HOT 3
- [App]: Select runs that logged image with the key test to visualize data here HOT 3
- [CLI]: resume from step can only resume once HOT 4
- [Q] How to use variables in nested config for custom expression plot?
- I cannot connect wandb using wandb.init() HOT 8
- CommError: It appears that you do not have permission to access the requested resource. Please reach out to the project owner to grant you access. If you have the correct permissions, verify that there are no issues with your networking setup.(Error 403: Forbidden)[App]: HOT 4
- Artifact download stalls with wandb core in offline mode HOT 3
- [CLI]: deadlock at end of wandb sessions HOT 2
- [Q]Why cannot log older step metrics? HOT 1
- Wandb Table being Logged Multiple Times
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wandb.