hieuphan33 / reminder Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 4.0 4.16 MB

Class Similarity Weighted Knowledge Distillation for Continual Semantic Segmentation

License: GNU General Public License v3.0

Python 71.79% Shell 28.12% Dockerfile 0.09%

continual-learning deep-learning pytorch semantic-segmentation

reminder's People

Contributors

Stargazers

Watchers

Forkers

codecse9 fsoft-aic anny-anny jinchaogjc

reminder's Issues

Problems about running with single GPU

Thanks for uploading the code. Recently Im trying to run the code .Below is my command:
python -m torch.distributed.launch --nproc_per_node=1 run.py --data_root data --batch_size 8 --dataset voc --name REMINDER --task 15-5s --step 0 --lr 0.01 --epochs 30 --method REMINDER
And I met the following errors.:

    epoch_loss = trainer.train(
  File "train.py", line 214, in train
    model.module.in_eval = False
  File "/home/ddd/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'IncrementalSegmentationModule' object has no attribute 'module'

Seems the problem is Im using the single GPU. I found your document noted that I need to rename the 'key' with 'key[:7]', but I also notice that in the file 'segmentation_module.py', line: 38, there already has the code 'key[:7]'.

Could you please give me any details about the modification ? Thank you very much !

Some question about Class similarity weighted knowledge distillation？

Hi， great work！ I have some questions about the Class similarity weighted knowledge distillation.
As shown in eq. 8, (1). is it assigned the weight to output when the label belongs to the new classes?
（2). when the pseudo label belongs to the old classes and the similarity s is large than the threshold, is it reweight the output O_{i,v}^{t-1}?

How to address this issue?

(pytorch) tao@tao:/media/tao/新加卷/Osman/REMINDER-main$ /bin/bash /media/tao/新加卷/Osman/REMINDER-main/train_voc_19-1.sh
voc_19-1_REMINDER On GPUs 0,1 Writing in results/2023-12-24_voc_19-1_REMINDER.csv
Begin training!
Begin training!
Learning for 1 with lrs=[0.01].
Learning for 1 with lrs=[0.01].
Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/init.py:68: DeprecatedFeatureWarning: apex.amp is deprecated and will be removed by the end of February 2023. Use PyTorch AMP
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/init.py:68: DeprecatedFeatureWarning: apex.parallel.DistributedDataParallel is deprecated and will be removed by the end of February 2023.
Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
Traceback (most recent call last):
File "run.py", line 587, in
main(opts)
File "run.py", line 158, in main
Traceback (most recent call last):
File "run.py", line 587, in
val_score = run_step(opts, world_size, rank, device)
File "run.py", line 299, in run_step
model = DistributedDataParallel(model, delay_allreduce=True)
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 257, in init
main(opts)
File "run.py", line 158, in main
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 77, in flat_dist_call
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 43, in apply_flat_dist_call
val_score = run_step(opts, world_size, rank, device)
File "run.py", line 299, in run_step
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast
model = DistributedDataParallel(model, delay_allreduce=True)
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 257, in init
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525552411/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 65000
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 77, in flat_dist_call
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 43, in apply_flat_dist_call
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525552411/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 65000
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 15679) of binary: /home/tao/anaconda3/envs/pytorch/bin/python3
Traceback (most recent call last):
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in
main()
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run.py FAILED

Failures:
[1]:
time : 2023-12-24_19:50:32
host : tao
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 15680)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-12-24_19:50:32
host : tao
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 15679)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

(pytorch) tao@tao:/media/tao/新加卷/Osman/REMINDER-main$

Question about the background class.

Hello,
In pascalVOC dataset, 0th class is background.
Does REMINDER make prototype for 0th class too?
Also, while calculate the similarity, does new class calculate with the 0th class?
If not, I want to know the part of excluding the 0th class in your code.

Thank you~

errata in /ade/reminder sh files

In the sh files for ade dataset, (reminder_100_50.sh, reminder_50.sh, reminder_100-10.sh, reminder_100-10.sh)

METHOD is set as FT, not the REMINDER.
( METHOD = FT )

maybe errata?

Issue Regarding Table 5

Hi @HieuPhan33, I have tried completely removing the additional KD loss, i.e. CSW-KD removed. The results shown below are still higher than UNKD and Knowledge Distillation in Table 5. Therefore, I doubt the authenticity of the ablation in Table 5. It's unlikely to get such low results since the intermediate layer KD as well as the final small-scale output KD proposed by PLOP are still in place.

	0-15	16-20	all
Reported	68.30	27.23	58.52
Reproduced	66.37	27.03	57.00
CSW-KD removed	61.97	27.87	53.85

cannot reproduce VOC 15-1

I directly ran with the script reminder_15-1.sh, but obtained lower results on old classes. I can reproduce 15-5 and 19-1 though. Looking forward to your advice.

	0-15	16-20	all
Reported	68.30	27.23	58.52
Reproduced	66.37	27.03	57.00

Cannot reproduce ADE20k

Hi, I tried to reproduce the ADE20k results as below. I reuse the step 0 checkpoint I train with PLOP (mIoU 41.98), as I think REMINDER shares the same training setup with PLOP for step 0. I attach the script reminder_100-50.sh and reminder_100-10.sh I use for your reference. I notice that you use the linear lr scheduling strategy when you scale down the batch size to 10 per GPU. For consistency, I adopt this strategy across all settings with the lr 0.0008. Additionally, I train the script with 2 GPUs as PLOP does. Hope for your advice.

	100-50			100-10			50-50
	0-100	101-150	all	0-100	101-150	all	0-50	51-150	all
Reported	41.55	19.16	34.14	38.96	21.28	33.11	47.11	20.35	29.39
Reproduced	41.91	16.10	33.36	38.43	15.56	30.86	45.56	18.49	27.63

issue when compute_prototype

line 26, in train.py, we can see:
seg = seg.view(-1) features = features.transpose(1, 3).contiguous().view(B * H * W, -1) for c in classes: selected_features = features[seg == c]
As you know, before I run these lines, the dimension of seg is [B, H, W], and features is [B, C, H, W]. After I run '.transpose(1, 3)', seg is flatten to [BHW], but features is [BWH, C], is that correct? I think selected features do not correspond to seg one to one, because the shape of features is [B, W, H, C] not [B, H, W, C] after '.transpose(1,3)'. Dose transpose(1, 3) should be replaced by permute([0, 2, 3, 1])?

Class similarity weighted KD makes no differences

Hi, I tried to run without your proposed csw loss by removing --csw_kd ${LOSS} --delta_csw 1.0. Strangely, I obtained almost the same results on ADE20k across different settings. Could you advice?

Got NaN in prototype

When I ran the 15-1 training code, I got the "NaN in prototype" error. Temporally I have no idea to solve it. Does anyone meet this problem ?

RuntimeError: 1only batches of spatial targets supported (3D tensors) but got targets of size: : [10, 512, 512, 3]

When I run the REMINDER/scripts/ade/reminder_100-10.sh

I got the error like below:
Traceback (most recent call last): File "run.py", line 585, in <module> main(opts) File "run.py", line 156, in main val_score = run_step(opts, world_size, rank, device) File "run.py", line 427, in run_step logger=logger File "/root/REMINDER/train.py", line 362, in train loss = criterion(outputs, labels) # B x H x W File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 948, in forward ignore_index=self.ignore_index, reduction=self.reduction) File "/usr/local/lib/python3.6/site-packages/apex/amp/wrap.py", line 28, in wrapper return orig_fn(*new_args, **kwargs) File "/usr/local/lib/python3.6/site-packages/torch/nn/functional.py", line 2422, in cross_entropy return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction) File "/usr/local/lib/python3.6/site-packages/apex/amp/wrap.py", line 28, in wrapper return orig_fn(*new_args, **kwargs) File "/usr/local/lib/python3.6/site-packages/torch/nn/functional.py", line 2220, in nll_loss ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index) RuntimeError: 1only batches of spatial targets supported (3D tensors) but got targets of size: : [10, 512, 512, 3]

When I checked the shape, labels.shape = [10, 512, 512, 3] but outputs.shape = [10, 101, 512, 512]

How can I matched the same shape with labels and outputs?

hieuphan33 / reminder Goto Github PK

reminder's People

Contributors

Stargazers

Watchers

Forkers

reminder's Issues

run.py FAILED

Failures: [1]: time : 2023-12-24_19:50:32 host : tao rank : 1 (local_rank: 1) exitcode : 1 (pid: 15680) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-12-24_19:50:32 host : tao rank : 0 (local_rank: 0) exitcode : 1 (pid: 15679) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Recommend Projects

Recommend Topics

Recommend Org

Failures:
[1]:
time : 2023-12-24_19:50:32
host : tao
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 15680)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-12-24_19:50:32
host : tao
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 15679)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html