Giter VIP home page Giter VIP logo

reminder's People

Contributors

hieuphan33 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

reminder's Issues

Problems about running with single GPU

Thanks for uploading the code. Recently Im trying to run the code .Below is my command:
python -m torch.distributed.launch --nproc_per_node=1 run.py --data_root data --batch_size 8 --dataset voc --name REMINDER --task 15-5s --step 0 --lr 0.01 --epochs 30 --method REMINDER
And I met the following errors.:

    epoch_loss = trainer.train(
  File "train.py", line 214, in train
    model.module.in_eval = False
  File "/home/ddd/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'IncrementalSegmentationModule' object has no attribute 'module'

Seems the problem is Im using the single GPU. I found your document noted that I need to rename the 'key' with 'key[:7]', but I also notice that in the file 'segmentation_module.py', line: 38, there already has the code 'key[:7]'.

Could you please give me any details about the modification ? Thank you very much !

Some question about Class similarity weighted knowledge distillation?

Hi, great work! I have some questions about the Class similarity weighted knowledge distillation.
As shown in eq. 8, (1). is it assigned the weight to output when the label belongs to the new classes?
(2). when the pseudo label belongs to the old classes and the similarity s is large than the threshold, is it reweight the output O_{i,v}^{t-1}?

How to address this issue?

(pytorch) tao@tao:/media/tao/新加卷/Osman/REMINDER-main$ /bin/bash /media/tao/新加卷/Osman/REMINDER-main/train_voc_19-1.sh
voc_19-1_REMINDER On GPUs 0,1 Writing in results/2023-12-24_voc_19-1_REMINDER.csv
Begin training!
Begin training!
Learning for 1 with lrs=[0.01].
Learning for 1 with lrs=[0.01].
Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/init.py:68: DeprecatedFeatureWarning: apex.amp is deprecated and will be removed by the end of February 2023. Use PyTorch AMP
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/init.py:68: DeprecatedFeatureWarning: apex.parallel.DistributedDataParallel is deprecated and will be removed by the end of February 2023.
Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
Traceback (most recent call last):
File "run.py", line 587, in
main(opts)
File "run.py", line 158, in main
Traceback (most recent call last):
File "run.py", line 587, in
val_score = run_step(opts, world_size, rank, device)
File "run.py", line 299, in run_step
model = DistributedDataParallel(model, delay_allreduce=True)
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 257, in init
main(opts)
File "run.py", line 158, in main
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 77, in flat_dist_call
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 43, in apply_flat_dist_call
val_score = run_step(opts, world_size, rank, device)
File "run.py", line 299, in run_step
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast
model = DistributedDataParallel(model, delay_allreduce=True)
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 257, in init
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525552411/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 65000
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 77, in flat_dist_call
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 43, in apply_flat_dist_call
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525552411/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 65000
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 15679) of binary: /home/tao/anaconda3/envs/pytorch/bin/python3
Traceback (most recent call last):
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in
main()
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run.py FAILED

Failures:
[1]:
time : 2023-12-24_19:50:32
host : tao
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 15680)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-12-24_19:50:32
host : tao
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 15679)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

(pytorch) tao@tao:/media/tao/新加卷/Osman/REMINDER-main$

Question about the background class.

Hello,
In pascalVOC dataset, 0th class is background.
Does REMINDER make prototype for 0th class too?
Also, while calculate the similarity, does new class calculate with the 0th class?
If not, I want to know the part of excluding the 0th class in your code.

Thank you~

errata in /ade/reminder sh files

In the sh files for ade dataset, (reminder_100_50.sh, reminder_50.sh, reminder_100-10.sh, reminder_100-10.sh)

METHOD is set as FT, not the REMINDER.
( METHOD = FT )

maybe errata?

Issue Regarding Table 5

Hi @HieuPhan33, I have tried completely removing the additional KD loss, i.e. CSW-KD removed. The results shown below are still higher than UNKD and Knowledge Distillation in Table 5. Therefore, I doubt the authenticity of the ablation in Table 5. It's unlikely to get such low results since the intermediate layer KD as well as the final small-scale output KD proposed by PLOP are still in place.

0-15 16-20 all
Reported 68.30 27.23 58.52
Reproduced 66.37 27.03 57.00
CSW-KD removed 61.97 27.87 53.85

cannot reproduce VOC 15-1

I directly ran with the script reminder_15-1.sh, but obtained lower results on old classes. I can reproduce 15-5 and 19-1 though. Looking forward to your advice.

0-15 16-20 all
Reported 68.30 27.23 58.52
Reproduced 66.37 27.03 57.00

Cannot reproduce ADE20k

Hi, I tried to reproduce the ADE20k results as below. I reuse the step 0 checkpoint I train with PLOP (mIoU 41.98), as I think REMINDER shares the same training setup with PLOP for step 0. I attach the script reminder_100-50.sh and reminder_100-10.sh I use for your reference. I notice that you use the linear lr scheduling strategy when you scale down the batch size to 10 per GPU. For consistency, I adopt this strategy across all settings with the lr 0.0008. Additionally, I train the script with 2 GPUs as PLOP does. Hope for your advice.

100-50 100-10 50-50
0-100 101-150 all 0-100 101-150 all 0-50 51-150 all
Reported 41.55 19.16 34.14 38.96 21.28 33.11 47.11 20.35 29.39
Reproduced 41.91 16.10 33.36 38.43 15.56 30.86 45.56 18.49 27.63

issue when compute_prototype

line 26, in train.py, we can see:
seg = seg.view(-1) features = features.transpose(1, 3).contiguous().view(B * H * W, -1) for c in classes: selected_features = features[seg == c]
As you know, before I run these lines, the dimension of seg is [B, H, W], and features is [B, C, H, W]. After I run '.transpose(1, 3)', seg is flatten to [BHW], but features is [BWH, C], is that correct? I think selected features do not correspond to seg one to one, because the shape of features is [B, W, H, C] not [B, H, W, C] after '.transpose(1,3)'. Dose transpose(1, 3) should be replaced by permute([0, 2, 3, 1])?

Class similarity weighted KD makes no differences

Hi, I tried to run without your proposed csw loss by removing --csw_kd ${LOSS} --delta_csw 1.0. Strangely, I obtained almost the same results on ADE20k across different settings. Could you advice?

Got NaN in prototype

When I ran the 15-1 training code, I got the "NaN in prototype" error. Temporally I have no idea to solve it. Does anyone meet this problem ?

RuntimeError: 1only batches of spatial targets supported (3D tensors) but got targets of size: : [10, 512, 512, 3]

When I run the REMINDER/scripts/ade/reminder_100-10.sh

I got the error like below:
Traceback (most recent call last): File "run.py", line 585, in <module> main(opts) File "run.py", line 156, in main val_score = run_step(opts, world_size, rank, device) File "run.py", line 427, in run_step logger=logger File "/root/REMINDER/train.py", line 362, in train loss = criterion(outputs, labels) # B x H x W File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 948, in forward ignore_index=self.ignore_index, reduction=self.reduction) File "/usr/local/lib/python3.6/site-packages/apex/amp/wrap.py", line 28, in wrapper return orig_fn(*new_args, **kwargs) File "/usr/local/lib/python3.6/site-packages/torch/nn/functional.py", line 2422, in cross_entropy return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction) File "/usr/local/lib/python3.6/site-packages/apex/amp/wrap.py", line 28, in wrapper return orig_fn(*new_args, **kwargs) File "/usr/local/lib/python3.6/site-packages/torch/nn/functional.py", line 2220, in nll_loss ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index) RuntimeError: 1only batches of spatial targets supported (3D tensors) but got targets of size: : [10, 512, 512, 3]

When I checked the shape, labels.shape = [10, 512, 512, 3] but outputs.shape = [10, 101, 512, 512]

How can I matched the same shape with labels and outputs?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.