Giter VIP home page Giter VIP logo

Comments (7)

mathildecaron31 avatar mathildecaron31 commented on August 25, 2024 8

Hi @pelletierlab
To train on 1 GPU I run
python -m torch.distributed.launch --nproc_per_node=1 main_dino.py --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir

To have faster runs, you could use --arch deit_tiny architecture instead of --arch deit_small

from dino.

iperov avatar iperov commented on August 25, 2024 1

I started it with 1 gpu on windows.

utils.py

dist.init_process_group(
        backend="gloo",   #<-- change to gloo
        init_method=args.dist_url,
        world_size=args.world_size,
        rank=args.rank,
    )

args.gpu = 0 # forcing gpu index , because by default it detected wrong index
del D:/somefile.asd

python -m torch.distributed.launch --nproc_per_node=1 main_dino.py --dist_url "file://D:/somefile.asd" --arch deit_small --data_path "faces" --output_dir "dino_save"

if out of memory occur, decrease --batch_size_per_gpu

from dino.

woctezuma avatar woctezuma commented on August 25, 2024

This is going to take a long time.

From the README, vanilla requires ~ 2 days with 8 GPUs:

Run DINO with DeiT-small network on a single node with 8 GPUs for 100 epochs with the following command.
Training time is 1.75 day and the resulting checkpoint should reach ~69.3% on k-NN eval and ~73.8% on linear eval.

and, boosted model requires ~ 3 days with 16 GPUs, for 3x more epochs with 2x more GPUs:

You can improve the performance of the vanilla run by:
[...]
-> training for more epochs: --epochs 300
[...]
The resulting pretrained model should reach ~73.4% on k-NN eval and ~76.1% on linear eval.
Training time is 2.6 days with 16 GPUs.

from dino.

Salah856 avatar Salah856 commented on August 25, 2024

interested

from dino.

mathildecaron31 avatar mathildecaron31 commented on August 25, 2024

See 534f37f

Now you should be able to run on 1 gpu directly with the following command:
python main_dino.py --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir

However I still recommend to use torch.distributed.launch

from dino.

ramdhan1989 avatar ramdhan1989 commented on August 25, 2024

I am using windows and pytorch version 1.5.0 only have 1 GPU. I tried suggestions above but got error below :
I run this python -m torch.distributed.launch --nproc_per_node=1 main_dino.py --data_path C:/Users/Owner/shopee/product_detection/train/train --output_dir checkpoints

Traceback (most recent call last):
  File "main_dino.py", line 461, in <module>
    train_dino(args)
  File "main_dino.py", line 131, in train_dino
    utils.init_distributed_mode(args)
  File "D:\Ramdhan\SSL\dino-main\utils.py", line 456, in init_distributed_mode
    dist.init_process_group(
AttributeError: module 'torch.distributed' has no attribute 'init_process_group'
Traceback (most recent call last):
  File "C:\Users\Owner\Anaconda3\envs\nlp\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Owner\Anaconda3\envs\nlp\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Owner\Anaconda3\envs\nlp\lib\site-packages\torch\distributed\launch.py", line 263, in <module>
    main()
  File "C:\Users\Owner\Anaconda3\envs\nlp\lib\site-packages\torch\distributed\launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\Owner\\Anaconda3\\envs\\nlp\\python.exe', '-u', 'main_dino.py', '--local_rank=0', '--data_path', 'C:/Users/Owner/shopee/product_detection/train/train', '--output_dir', 'checkpoints']' returned non-zero exit status 1.

please advise, is there anyway to run it on windows with 1 GPU ?

from dino.

LLL-YUE avatar LLL-YUE commented on August 25, 2024

Hi @pelletierlab To train on 1 GPU I run python -m torch.distributed.launch --nproc_per_node=1 main_dino.py --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir

To have faster runs, you could use --arch deit_tiny architecture instead of --arch deit_small

I tried this command but got error RuntimeError: No rendezvous handler for env://
Could you tell me how to solve this problem?
Thank you!

from dino.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.