zhenzhao / augseg Goto Github PK

View Code? Open in Web Editor NEW

103.0 4.0 13.0 4.99 MB

[CVPR'23] Augmentation Matters: A Simple-yet-Effective Approach to Semi-supervised Semantic Segmentation

Home Page: https://arxiv.org/abs/2212.04976

Python 96.88% Shell 3.12%

semi-supervised-learning semi-supervised-segmentation

augseg's People

Contributors

Stargazers

Watchers

Forkers

woaixuexi7 davis-love-ai bwlong whuhxb conniec14 winterpan2017 cv-seg pipixiapipi kalibianw lin-uh cv-seg sattwik-sahu

augseg's Issues

设备端断言问题

作者您好，我在运行的时候出现如下问题：
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
请问应该如何解决

运行时长

作者您好，我是在A6000上两张卡跑的voc实验（每张卡占用24GB）。但是需要跑一天多，为什么这么长时间呢？我看您的就跑两三个小时。而且我的输出日志中Epoch/Iter 比您的多，每轮每个类别的test也要多一轮，这是什么原因呀！下面是我的输出文件。
r50662.log

作者您好，非常感谢你的工作。
您在文中强调，strong augumentation的目的是产生prediction disagreement，但对为什么prediction disagreement能提升性能，没有做太多解释。
不知道我这么理解对不对：与无监督对比学习同理，在strong augmentation下，消除S-T不一致，将迫使Student网络，过滤掉被augmentation破坏的低层信息（如色彩、纹理等），而专注于提取语义信息。
希望作者解答一下，感谢！

BTW，arxiv版论文的公式4、5，theta_s和theta_t似乎是写反了？

关于结果复现

您好！感谢您的出色工作！
我在使用源代码复现时，发现resnet-101下voc fine 92labeled配置只跑出了63.5的MIoU，显著低于原论文汇报的71.09，我使用的config是training log下相同实验的yaml文件，请问是哪里出了差错吗？另外，同样方式183labeled设置是可以复现结果的。
谢谢！

Adding a LICENSE

Hi, this is great work, and I'm excited to try it out! Would it be possible to add a LICENSE to this codebase?

Can't Run At All

Yo, I tried running your project. I am sad to tell you you have been diagnosed with extreme noobiosis. Yeah, now don't go about checking your little english to chinese dictionary to find this word, you won't get anything! Such NOOBS! Absolutely outrageous, man.

Tell me, you people ever heard of something called a requirements.txt file? Huh??? Ever? I had to install every single dependency waiting on errors for Module not Found... and even after doing all of that, it throws an error. Wait, I'll show you:

$ sh ./single_run.sh

./single_run.sh: 4: source: not found
/DATA2/dse316/grp_007/.venv/lib/python3.10/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
[2024-03-14 19:49:39,700] torch.distributed.run: [WARNING] 
[2024-03-14 19:49:39,700] torch.distributed.run: [WARNING] *****************************************
[2024-03-14 19:49:39,700] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-03-14 19:49:39,700] torch.distributed.run: [WARNING] *****************************************
2024-03-14 19:49:41.386765: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-14 19:49:41.435739: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-14 19:49:41.476173: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-14 19:49:41.512059: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-14 19:49:41.523980: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-14 19:49:41.561311: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-14 19:49:41.573819: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-14 19:49:41.623285: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-14 19:49:42.237202: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-03-14 19:49:42.326103: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-03-14 19:49:42.379277: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-03-14 19:49:42.600367: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
usage: train_semi.py [-h] [--config CONFIG] [--local_rank LOCAL_RANK] [--seed SEED] [--port PORT]
train_semi.py: error: unrecognized arguments: --local-rank=0
usage: train_semi.py [-h] [--config CONFIG] [--local_rank LOCAL_RANK] [--seed SEED] [--port PORT]
train_semi.py: error: unrecognized arguments: --local-rank=2
usage: train_semi.py [-h] [--config CONFIG] [--local_rank LOCAL_RANK] [--seed SEED] [--port PORT]
train_semi.py: error: unrecognized arguments: --local-rank=1
usage: train_semi.py [-h] [--config CONFIG] [--local_rank LOCAL_RANK] [--seed SEED] [--port PORT]
train_semi.py: error: unrecognized arguments: --local-rank=3
[2024-03-14 19:49:49,716] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 3248971) of binary: /DATA2/dse316/grp_007/.venv/bin/python
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/DATA2/dse316/grp_007/.venv/lib/python3.10/site-packages/torch/distributed/launch.py", line 198, in <module>
    main()
  File "/DATA2/dse316/grp_007/.venv/lib/python3.10/site-packages/torch/distributed/launch.py", line 194, in main
    launch(args)
  File "/DATA2/dse316/grp_007/.venv/lib/python3.10/site-packages/torch/distributed/launch.py", line 179, in launch
    run(args)
  File "/DATA2/dse316/grp_007/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/DATA2/dse316/grp_007/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/DATA2/dse316/grp_007/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
./train_semi.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-14_19:49:49
  host      : pragyan
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 3248972)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-14_19:49:49
  host      : pragyan
  rank      : 2 (local_rank: 2)
  exitcode  : 2 (pid: 3248973)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-14_19:49:49
  host      : pragyan
  rank      : 3 (local_rank: 3)
  exitcode  : 2 (pid: 3248974)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-14_19:49:49
  host      : pragyan
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 3248971)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Not entirely unintelligible gibberish or gobbledygook. You, being such fantastic researchers might have faced this issue millions of times! Well, I hope you have, but not very imminent, as the sorry state of affairs I observe in your repository tells me. Is this code or burnt 5 day old mix spaghetti?

Now, if a single one of you have any brain cells left in your sorry little cranium, would you do as much as help me run your project?

Give me a line by line guide on how to run it. Line. By. Line.

Bro look I have a project to submit by the end of this month and I seriously don't get why you would provide such obscure documentation and code for this paper. Sure, your paper might be good, but were you able to implement it on your own machine first of all? That aside, help from your side would be highly appreciated.

Cheers

关于过早达到最高iou的问题

作者您好，非常感谢你的工作。我发现您的iters好像较其他的工作更多一些，所以这是否会收敛的更快一些，我跑别的代码大概4 50轮才收敛，您的大概的20多轮就收敛了，这是否是因为iters更多的原因

效果提升，但是无监督loss不收敛

你好，请问使用了Adaptive cutmix 之后出现了效果上升，但是无监督loss一直震荡的现象，请问作者您当时是怎么避免的

Pretrained Checkpoints are missing.

Hi Zhen,

Thanks for your work and the code.

The links of pretrained checkpoints (for both resnet50 and resnet101) are incorrect, could you please have a look?
I've tried the checkpoints from CPS (and U2PL), but I've got some unexpected_keys which seem don't appear in your training log.

missing_keys:  [] 
unexpected_keys:  ['fc.weight', 'fc.bias']

Cheers,
Yuyuan

验证函数运行慢

我复现你的代码，出现了验证部分的代码validate_citys函数运行的特别慢的问题，训练一个epoch只要十分钟，但是验证要一个小时。请问你当初也是这样吗？能给些建议哪里能改吗？

改為物件偵測任務

看到這篇論文的想法覺得非常有趣，是否可以使用這篇論文的方法改為半監督物件偵測呢!

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

$ sh ./single_run.sh >> "error.txt"

./single_run.sh: 4: source: not found
/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 591, in <module>
    main(args)
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 71, in main
    dist.barrier()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2784, in barrier
Traceback (most recent call last):
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 591, in <module>
        main(args)
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 71, in main
    dist.barrier()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2784, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Traceback (most recent call last):
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 591, in <module>
    main(args)
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 71, in main
    dist.barrier()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2784, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
Traceback (most recent call last):
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 591, in <module>
    main(args)
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 71, in main
    dist.barrier()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2784, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3317586 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3317587 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3317584) of binary: /home/dse316/miniconda3/envs/grp_007/bin/python
Traceback (most recent call last):
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
./train_semi.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-19_17:20:25
  host      : pragyan
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3317585)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-19_17:20:25
  host      : pragyan
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3317584)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Ran steps as mentioned in issue #23
Downloaded gtFine dataset into ./data/cityscapes
Set the paths in augseg/exps/zrun_citys/citys_semi744/config_semi.yaml
Downloaded resnet50.pth in ./pretrained
Ran ./single_run.sh
Got this error. Please help 😢

SegmentationClassAug

Excellent work!

Could you also provide link for Downloading SegmentationClassAug ?

Thank you so much

更新网络中的buffer

您好，感谢您卓越的工作，我在阅读您的代码时，发现以下几行代码，不太理解，为什么要更新教师网络的buffer

update bn

for buffer_train, buffer_eval in zip(model.buffers(), model_teacher.buffers()):
buffer_eval.data = buffer_eval.data * ema_decay + buffer_train.data * (1 - ema_decay)
希望能得到您的回复，谢谢

关于最终结果和模型保存

最后使用的结果值是用teacher验证的MIOU,那是不是说明保存的checkpoint不应该是student而应该是teacher？

测试代码

作者你好！能麻烦你提供一下你的测试代码吗？我想对我自建的数据集进行测试。

关于cut_mix_label_adaptive的疑问

作者你好，我在我的数据集上运行了你的代码，实验结果很不错。但我有一个问题，就是在cut_mix_label_adaptive函数中，为什么要用到两次cutmix？这个本质不就是将一个更小的labeled区域转移到unlabeled吗？感觉用一次cutmix就行了。有点不太理解，希望作者解答一下，感谢！

CPS结果

您好，请问为什么不比较CPS在cityscapes上的最好结果？

关于Adaptive Label-aided CutMix第三步的疑问

你好，本文的Adaptive Label-aided CutMix一共有三步，第二步已经利用pi进行了一次cutMix，为什么还又在第三步再cutMix，这样设计是有什么想法吗

Cityscapes 数据集上配置的问题

   作者您好，最近我在您的项目上进行修改学习，目前在pascal voc 2012 数据集上的效果良好，可是在Cityscapes数据集上的效果很差，我仔细研究了您所展示的配置文件以及实验日志，但是目前问题也没有解决，于是想来问一下关于这个数据集上有什么需要注意的地方？首先在这个数据集上损失计算使用了ohem损失，在计算损失中没有用到辅助损失以及类别的权重。然后我注意到同样batch×gpu数目下这个数据集的学习率扩大了10倍，在训练轮次上也调整到了240轮，类别数目为19。除此之外，在评估过程中使用了滑动窗口评估。注意到了这些问题后，我的模型在cityscapes数据集上效果还是很差，目前这个阶段遇到这个问题很困惑不知如何解决。可以给我一些建议吗？如果可以的话真的非常感谢！

Problem with Input Data

Hello there!

I am trying to reproduce the results of your publication for my course project. However, I think there is some issue with the "Pascal: JPEGImages | SegmentationClass" data set. It keeps on giving the error "File not found". The complete error has been provided below:

FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

 warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[2024-04-13 19:40:45,183][INFO] - {'criterion': {'kwargs': {'use_weight': False}, 'type': 'CELoss'},
 'dataset': {'ignore_label': 255,
             'mean': [0.485, 0.456, 0.406],
             'n_sup': 662,
             'std': [0.229, 0.224, 0.225],
             'train': {'batch_size': 8,
                       'crop': {'size': [513, 513], 'type': 'rand'},
                       'data_list': './data/splitsall/pascal_u2pl/662/labeled.txt',
                       'data_root': './data/VOC2012',
                       'flip': True,
                       'rand_resize': [0.5, 2.0],
                       'resize_base_size': 500,
                       'strong_aug': {'flag_use_random_num_sampling': True,
                                      'num_augs': 3}},
             'type': 'pascal_semi',
             'val': {'batch_size': 1,
                     'data_list': './data/splitsall/pascal_u2pl/val.txt',
                     'data_root': './data/VOC2012'},
             'workers': 4},
 'exp_path': './exps/zrun_vocs_u2pl/voc_semi662',
 'log_path': './exps/zrun_vocs_u2pl/voc_semi662/log',
 'net': {'decoder': {'kwargs': {'dilations': [6, 12, 18],
                                'inner_planes': 256,
                                'low_conv_planes': 48},
                     'type': 'augseg.models.decoder.dec_deeplabv3_plus'},
         'ema_decay': 0.999,
         'encoder': {'kwargs': {'multi_grid': True,
                                'replace_stride_with_dilation': [False,
                                                                 False,
                                                                 True],
                                'zero_init_residual': True},
                     'pretrain': './pretrained/resnet101.pth',
                     'type': 'augseg.models.resnet.resnet101'},
         'num_classes': 21,
         'sync_bn': True},
 'save_path': './exps/zrun_vocs_u2pl/voc_semi662/checkpoints',
 'saver': {'pretrain': '', 'snapshot_dir': 'checkpoints', 'use_tb': False},
 'trainer': {'epochs': 80,
             'evaluate_student': True,
             'lr_scheduler': {'kwargs': {'power': 0.9}, 'mode': 'poly'},
             'optimizer': {'kwargs': {'lr': 0.001,
                                      'momentum': 0.9,
                                      'weight_decay': 0.0001},
                           'type': 'SGD'},
             'sup_only_epoch': 0,
             'unsupervised': {'flag_extra_weak': False,
                              'loss_weight': 1.0,
                              'threshold': 0.95,
                              'use_cutmix': True,
                              'use_cutmix_adaptive': True,
                              'use_cutmix_trigger_prob': 1.0}}}
[Info] Load ImageNet pretrain from './pretrained/resnet101.pth' 
missing_keys:  [] 
unexpected_keys:  ['fc.weight', 'fc.bias']
[Info] Load ImageNet pretrain from './pretrained/resnet101.pth' 
missing_keys:  [] 
unexpected_keys:  ['fc.weight', 'fc.bias']
[2024-04-13 19:40:55,377][INFO] - # samples: 662
[2024-04-13 19:40:55,390][INFO] - # samples: 9920
[2024-04-13 19:40:55,396][INFO] - # samples: 1449
[2024-04-13 19:40:55,396][INFO] - Get loader Done...
[Info] Load ImageNet pretrain from './pretrained/resnet101.pth' 
missing_keys:  [] 
unexpected_keys:  ['fc.weight', 'fc.bias']
[Info] Load ImageNet pretrain from './pretrained/resnet101.pth' 
missing_keys:  [] 
unexpected_keys:  ['fc.weight', 'fc.bias']
[2024-04-13 19:40:58,584][INFO] - -------------------------- start training --------------------------
Traceback (most recent call last):
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 591, in <module>
    main(args)
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 172, in main
    res_loss_sup, res_loss_unsup = train(
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 301, in train
    _, image_u_weak, image_u_aug, _ = loader_u_iter.next()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
    return self._process_data(data)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
    data.reraise()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/DATA2/dse316/grp_007/augseg/augseg/dataset/pascal_voc.py", line 63, in __getitem__
    label = self.img_loader(label_path, "L")
  File "/DATA2/dse316/grp_007/augseg/augseg/dataset/base.py", line 44, in img_loader
    with open(path, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/VOC2012/SegmentationClassAug/2008_006330.png'

Traceback (most recent call last):
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 591, in <module>
    main(args)
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 172, in main
    res_loss_sup, res_loss_unsup = train(
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 301, in train
    _, image_u_weak, image_u_aug, _ = loader_u_iter.next()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
    return self._process_data(data)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
    data.reraise()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/DATA2/dse316/grp_007/augseg/augseg/dataset/pascal_voc.py", line 63, in __getitem__
    label = self.img_loader(label_path, "L")
  File "/DATA2/dse316/grp_007/augseg/augseg/dataset/base.py", line 44, in img_loader
    with open(path, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/VOC2012/SegmentationClassAug/2008_000085.png'

Exception in thread Thread-1 (_pin_memory_loop):
Traceback (most recent call last):
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 297, in rebuild_storage_fd
    fd = df.detach()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/multiprocessing/connection.py", line 508, in Client
    answer_challenge(c, authkey)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/multiprocessing/connection.py", line 752, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 686256) of binary: /home/dse316/miniconda3/envs/grp_007/bin/python
Traceback (most recent call last):
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
./train_semi.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-04-13_19:41:03
  host      : pragyan
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 686257)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-13_19:41:03
  host      : pragyan
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 686256)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Anticipating a positive response. Please cross check the data source file contains all the files and the link on the GitHub repository is correct.

backbone模型权重

作者你好，项目中提供的resnet50和resnet101的权重是自己在ImageNet训练的吗？

Question related to supervised learning code.

I appreciate your excellent work.

I want to replicate the experiments described in the paper now. I have already tried the semi-supervised learning experiment since the code exists, but I haven't been able to attempt the supervised learning part as the code is not available. I'm curious about the differences you made between the semi-supervised and supervised learning experiments. Also, I would like to receive your feedback on writing code for supervised learning. Thank you.

Question about the reproduced U2PL

Hi! I just read the paper of AugSeg and find a statement "Since U2PL prioritizes selecting high-quality labels from classic VOCs for testing on blender VOC, we reproduce the supervised baseline and its performance on ResNet-50 for fair comparisons". Could I ask what does it mean? Where has indicated the fact that U2PL prioritizes selecting high-quality labels? Thanks a lot!