With cfg.DIST_ENABLE false all the distributed specif

I meant that I am trying to test single GPU. In many places <code class="notransla

This config is designed for training. <span class="email-hidden-to

I suppose it should be ok. <a hr

Sorry, I don't understand what you mean. <span class="email-hidden

cfg.DIST_ENABLE false fail about aot-benchmark HOT 21 OPEN

yoxu515 commented on August 30, 2024

cfg.DIST_ENABLE false fail

from aot-benchmark.

Comments (21)

z-x-yang commented on August 30, 2024

self.DIST_ENABLE is necessary for multi-GPU training.

from aot-benchmark.

bhack commented on August 30, 2024

I meant that I am trying to test single GPU.
In many places cfg.DIST_ENABLE is checked to safely go through the non distributed code path.

But in many other places not:

aot-benchmark/networks/managers/trainer.py

Lines 342 to 352 in 1c3a5ec

 self.train_sampler = torch.utils.data.distributed.DistributedSampler( 

 train_dataset) 

 self.train_loader = DataLoader(train_dataset, 

 batch_size=int(cfg.TRAIN_BATCH_SIZE / 

 cfg.TRAIN_GPUS), 

 shuffle=False, 

 num_workers=cfg.DATA_WORKERS, 

 pin_memory=True, 

 sampler=self.train_sampler, 

 drop_last=True, 

 prefetch_factor=4)

from aot-benchmark.

bhack commented on August 30, 2024

E.g. here instead the code is checking cfg.DIST_ENABLE

aot-benchmark/networks/managers/trainer.py

Lines 59 to 83 in 1c3a5ec

 if cfg.DIST_ENABLE: 

 dist.init_process_group(backend=cfg.DIST_BACKEND, 

 init_method=cfg.DIST_URL, 

 world_size=cfg.TRAIN_GPUS, 

 rank=rank, 

 timeout=datetime.timedelta(seconds=300)) 

 self.model.encoder = nn.SyncBatchNorm.convert_sync_batchnorm( 

 self.model.encoder).cuda(self.gpu) 

 self.dist_engine = torch.nn.parallel.DistributedDataParallel( 

 self.engine, 

 device_ids=[self.gpu], 

 output_device=self.gpu, 

 find_unused_parameters=True, 

 broadcast_buffers=False) 

 else: 

 self.dist_engine = self.engine 

 self.use_frozen_bn = False 

 if 'swin' in cfg.MODEL_ENCODER: 

 self.print_log('Use LN in Encoder!') 

 elif not cfg.MODEL_FREEZE_BN: 

 if cfg.DIST_ENABLE: 

 self.print_log('Use Sync BN in Encoder!')

from aot-benchmark.

z-x-yang commented on August 30, 2024

The distributed sampler is useless and meaningless for evaluation, where GPUs are asynchronous instead of synchronous. The video lengths are always different for different GPUs.

from aot-benchmark.

bhack commented on August 30, 2024

Is this not in the trainer?
I am trying to train a single GPU job with cfg.DIST_ENABLE=False

from aot-benchmark.

z-x-yang commented on August 30, 2024

This config is designed for training.

…

On Mar 7, 2023, at 23:49, bhack ***@***.***> wrote: Is this not in the trainer? I am trying to train a single GPU job with cfg.DIST_ENABLE=False — Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANX4YPDXSYPPW5OOK75E233W25KILANCNFSM6AAAAAAVOZMQ2Q>. You are receiving this because you commented.

from aot-benchmark.

bhack commented on August 30, 2024

Yes it is what I meant. Are we not going to have issue if we don't conditional wrap torch.nn.parallel.DistributedDataParallel in the trainer?

from aot-benchmark.

z-x-yang commented on August 30, 2024

I suppose it should be ok.

…

On Mar 8, 2023, at 00:00, bhack ***@***.***> wrote: Yes it is what I meant. Are we not going to have issue if we don't conditional wrap torch.nn.parallel.DistributedDataParallel in the trainer? — Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANX4YPGAB5G46JB5EUT5UALW25LSPANCNFSM6AAAAAAVOZMQ2Q>. You are receiving this because you commented.

from aot-benchmark.

bhack commented on August 30, 2024

Isn't that one going to require init_process_group? But it is conditional wrapped in the trainer.

aot-benchmark/networks/managers/trainer.py

Lines 59 to 64 in 1c3a5ec

 if cfg.DIST_ENABLE: 

 dist.init_process_group(backend=cfg.DIST_BACKEND, 

 init_method=cfg.DIST_URL, 

 world_size=cfg.TRAIN_GPUS, 

 rank=rank, 

 timeout=datetime.timedelta(seconds=300))

from aot-benchmark.

z-x-yang commented on August 30, 2024

Sorry, I don't understand what you mean.

…

On Mar 8, 2023, at 00:54, bhack ***@***.***> wrote: Isn't that one going to require init_process_group? But it is conditional wrapped in the trainer. https://github.com/yoxu515/aot-benchmark/blob/1c3a5ec51d81f3e17ff9092aa1e830206d766132/networks/managers/trainer.py#L59-L64 — Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANX4YPDVCSEOT5LR3KPQOQLW25R5ZANCNFSM6AAAAAAVOZMQ2Q>. You are receiving this because you commented.

from aot-benchmark.

bhack commented on August 30, 2024

E.g. with self.DIST_ENABLE = False in configs/default.py we are going to fail directly at:

aot-benchmark/networks/managers/trainer.py

Line 342 in 1c3a5ec

self.train_sampler = torch.utils.data.distributed.DistributedSampler(

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Cause:
#36 (comment)

As we don't have completely covered cfg.DIST_ENABLE safeguards and the relatrive alternative code path.

from aot-benchmark.

bhack commented on August 30, 2024

Sorry, I don't understand what you mean.

Is it clear now?

from aot-benchmark.

z-x-yang commented on August 30, 2024

I have updated the trainer.py, and it should be ok to set self.DIST_ENABLE = False and train with a single GPU.

from aot-benchmark.

bhack commented on August 30, 2024

Thanks, I've checked your changes and they were the same I've done locally on my side in these days.

What do you think now about this (pytorch/pytorch#37444)?

aot-benchmark/tools/train.py

Lines 78 to 79 in d5bd73f

 # Use torch.multiprocessing.spawn to launch distributed processes 

 mp.spawn(main_worker, nprocs=cfg.TRAIN_GPUS, args=(cfg, args.amp))

from aot-benchmark.

bhack commented on August 30, 2024

Other then my previous comment I think that we still have now an issue with the keys with DIST_ENABLE=False:

aot-benchmark/networks/managers/trainer.py

Lines 677 to 682 in d5bd73f

 for key in boards['image'].keys(): 

 tmp = boards['image'][key].cpu().numpy() 

 self.tblogger.add_image('S{}/' + key, tmp, step) 

 for key in boards['scalar'].keys(): 

 tmp = boards['scalar'][key].cpu().numpy() 

 self.tblogger.add_scalar('S{}/' + key, tmp, step)

    for key in boards['image'].keys():
AttributeError: 'list' object has no attribute 'keys'

from aot-benchmark.

z-x-yang commented on August 30, 2024

It should be ok now.

from aot-benchmark.

bhack commented on August 30, 2024

It seems that we have two issues:

The first is that the trainer it seems to be "randomly" deadlocked on different runs same code
images in img_logs seems to be not correct anymore as they are only the binary masks instead they was "composited" if I remember correctly.

from aot-benchmark.

bhack commented on August 30, 2024

For the first issue this is the stacktrace of one of the deadlock and it could be related to #36 (comment):

  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 109, in join
    ready = multiprocessing.connection.wait(
  File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 936, in wait
    ready = selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)

from aot-benchmark.

z-x-yang commented on August 30, 2024

I guess the first problem is due to torch.spawn, and I have modified the code related to it. Please take a try, and hope this will work for you.

As to the second issue, img_logs should be images where target objects are marked with colorful masks, referring to here.

from aot-benchmark.

bhack commented on August 30, 2024

I guess the first problem is due to torch.spawn, and I have modified the code related to it. Please take a try, and hope this will work for you.

Yes, it seems ok now.

As to the second issue, img_logs should be images where target objects are marked with colorful masks, referring to here.

Yes sorry this was a false positive related to the image range in the compositing phase, thanks for the double check.

from aot-benchmark.

bhack commented on August 30, 2024

The same need to be fixed in the PAOT branch.

from aot-benchmark.

cfg.DIST_ENABLE false fail about aot-benchmark HOT 21 OPEN

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	self.train_sampler = torch.utils.data.distributed.DistributedSampler(
	train_dataset)
	self.train_loader = DataLoader(train_dataset,
	batch_size=int(cfg.TRAIN_BATCH_SIZE /
	cfg.TRAIN_GPUS),
	shuffle=False,
	num_workers=cfg.DATA_WORKERS,
	pin_memory=True,
	sampler=self.train_sampler,
	drop_last=True,
	prefetch_factor=4)

	if cfg.DIST_ENABLE:
	dist.init_process_group(backend=cfg.DIST_BACKEND,
	init_method=cfg.DIST_URL,
	world_size=cfg.TRAIN_GPUS,
	rank=rank,
	timeout=datetime.timedelta(seconds=300))

	self.model.encoder = nn.SyncBatchNorm.convert_sync_batchnorm(
	self.model.encoder).cuda(self.gpu)

	self.dist_engine = torch.nn.parallel.DistributedDataParallel(
	self.engine,
	device_ids=[self.gpu],
	output_device=self.gpu,
	find_unused_parameters=True,
	broadcast_buffers=False)
	else:
	self.dist_engine = self.engine

	self.use_frozen_bn = False
	if 'swin' in cfg.MODEL_ENCODER:
	self.print_log('Use LN in Encoder!')
	elif not cfg.MODEL_FREEZE_BN:
	if cfg.DIST_ENABLE:
	self.print_log('Use Sync BN in Encoder!')

	# Use torch.multiprocessing.spawn to launch distributed processes
	mp.spawn(main_worker, nprocs=cfg.TRAIN_GPUS, args=(cfg, args.amp))

	for key in boards['image'].keys():
	tmp = boards['image'][key].cpu().numpy()
	self.tblogger.add_image('S{}/' + key, tmp, step)
	for key in boards['scalar'].keys():
	tmp = boards['scalar'][key].cpu().numpy()
	self.tblogger.add_scalar('S{}/' + key, tmp, step)