chenyuntc / pytorch-best-practice Goto Github PK
View Code? Open in Web Editor NEWA Guidance on PyTorch Coding Style Based on Kaggle Dogs vs. Cats
A Guidance on PyTorch Coding Style Based on Kaggle Dogs vs. Cats
为什么在val中还加入model.train()
因为我在Python3运行,所以要做一些小的修改,,
win10-64、CPU环境,
1.utils/visualize.py 44行:win=unicode(name) --> win=str(name)
2.main.py 22行: 加 import config
3.main.py 108行:loss_meter.add(loss.data[0]) --> loss_meter.add(loss.item())
4.config.py 10行:load_model_path = 'checkpoints/model.pth' --> load_model_path = None
5.config.py 12行:batch_size = 128 --> batch_size = 8
6.config.py 21行:lr = 0.1 --> lr = 0.001
7.config.py 31行:for k,v in kwargs.iteritems() --> for k,v in kwargs.items()
8.没有执行python -m visdom.server,配置好路径之后直接 python main.py train
打印出loss格式如下,发现loss一直在0.6-1.5之间浮动:
loss: tensor(0.7035, grad_fn=)
也出现了别的同学说的准确率一直在50%左右,也就是学了跟不学一样,
$ CUDA_VISIBLE_DEVICES='2,3' python main.py train --train-data-root=data/train/ --lr=0.005 --batch-size=32 --model='ResNet34' --max-epoch = 20 --use-gpu --env=classifier
TypeError: 'str' object cannot be interpreted as an integer
user config:
env classifier
vis_port 8097
model ResNet34
train_data_root data/train/
test_data_root ./data/test1
load_model_path None
batch_size 32
use_gpu True
num_workers 4
print_freq 20
debug_file /tmp/debug
result_file result.csv
max_epoch =
lr 0.005
lr_decay 0.5
weight_decay 0.0
WARNING:root:Setting up a new session...
WARNING:visdom:Without the incoming socket you cannot receive events from the server or register event handlers to your Visdom client.
Traceback (most recent call last):
File "main.py", line 168, in <module>
fire.Fire()
File "/home/deepliver4/.conda/envs/py36/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/home/deepliver4/.conda/envs/py36/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/home/deepliver4/.conda/envs/py36/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "main.py", line 79, in train
for epoch in range(opt.max_epoch):
TypeError: 'str' object cannot be interpreted as an integer
程序在运行的时候出现
"please use transforms.Resize instead.")
/usr/local/lib/python2.7/dist-packages/torchvision/transforms/transforms.py:563: UserWarning: The use of the transforms.RandomSizedCrop transform is deprecated, please use transforms.RandomResizedCrop instead.
"please use transforms.RandomResizedCrop instead.")
1%| | 137/17500 [01:50<3:33:34, 1.35it/s]
1%| | 137/17500 [01:49<3:34:13, 1.35it/s]
1%| | 137/17500 [01:49<3:33:45, 1.35it/s]
1%| | 137/17500 [01:49<3:34:31, 1.35it/s]
1%| | 137/17500 [01:49<3:33:46, 1.35it/s]
1%| | 137/17500 [01:49<3:33:40, 1.35it/s]
1%| | 137/17500 [01:49<3:33:45, 1.35it/s]
1%| | 137/17500 [01:49<3:32:45, 1.36it/s]
1%| | 137/17500 [01:49<3:32:46, 1.36it/s]
1%| | 137/17500 [01:49<3:32:01, 1.36it/s]
*** Error in `python': munmap_chunk(): invalid pointer: 0x0000000002a22030 ***
======= Backtrace: =========
下面还有一大堆
7f17a776c000-7f17a796b000 ---p 0021b000 08:06 92012725 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7f17a796b000-7f17a7987000 r--p 0021a000 08:06 92012725 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0已放弃 (核心已转储)
请问这个问题怎么解决?
No such file or directory: 'checkpoints/model.pth'
File "main.py", line 171, in
fire.Fire()
File "/home/thinkjoy/anaconda3/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/home/thinkjoy/anaconda3/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/home/thinkjoy/anaconda3/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "main.py", line 49, in train
opt.parse(kwargs)
File "/home/thinkjoy/PycharmProjects/pytorch-best-practice/config.py", line 30, in parse
for k,v in kwargs.iteritems():
AttributeError: 'dict' object has no attribute 'iteritems'
@chenyuntc 你好,我按照教程的代码自己实践了一下,训练过程中发现visdom的val_accuracy始终在50%左右,验证集的混淆矩阵也基本只有一类有值,我以为自己哪里写错了,又把原代码跑了一遍,发现也是一样的现象,训练过程中的可视化结果如下图,按道理val_accuracy应该会随着训练的进行不断增加,不知道是哪里有问题?如果有遇到类似问题的朋友也请指教一下,先行谢过!
在执行的过程中发生了数据溢出,下面是执行过程中的输出:
python main.py train --train-data-root=/home/linux_fhb/data/cat_vs_dog/train --use-gpu --env=classifier
user config:
env classifier
model ResNet34
train_data_root /home/linux_fhb/data/cat_vs_dog/train
test_data_root ./data/test1
load_model_path None
batch_size 32
use_gpu True
num_workers 4
print_freq 20
debug_file /tmp/debug
result_file result.csv
max_epoch 10
lr 0.1
lr_decay 0.95
weight_decay 0.0001
parse <bound method parse of <config.DefaultConfig object at 0x7f3e4a85b400>>
/home/linux_fhb/anaconda3/lib/python3.6/site-packages/torchvision/transforms/transforms.py:188: UserWarning: The use of the transforms.Scale transform is deprecated, please use transforms.Resize instead.
"please use transforms.Resize instead.")
/home/linux_fhb/anaconda3/lib/python3.6/site-packages/torchvision/transforms/transforms.py:563: UserWarning: The use of the transforms.RandomSizedCrop transform is deprecated, please use transforms.RandomResizedCrop instead.
"please use transforms.RandomResizedCrop instead.")
0%| | 0/17500 [00:00<?, ?it/s]main.py:99: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
loss_meter.add(loss.data[0])
3%|█▏ | 547/17500 [02:09<1:05:07, 4.34it/s]
main.py:138: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
val_input = Variable(input, volatile=True)
main.py:139: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
val_label = Variable(label.type(t.LongTensor), volatile=True)
Traceback (most recent call last):
File "main.py", line 171, in <module>
fire.Fire()
File "/home/linux_fhb/anaconda3/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/home/linux_fhb/anaconda3/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/home/linux_fhb/anaconda3/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "main.py", line 121, in train
if loss_meter.value()[0] > previous_loss:
RuntimeError: value cannot be converted to type float without overflow: 10000000000000000159028911097599180468360808563945281389781327557747838772170381060813469985856815104.000000
其中环境的版本号为:
Python 3.6.5 :: Anaconda, Inc.
fire 0.1.3
numpy 1.14.3
numpydoc 0.8.0
torch 0.4.1
torchfile 0.1.0
torchnet 0.0.4
torchvision 0.2.1
visdom 0.1.8.5
显卡版本为:NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1), 11G 显存;
有遇到相同问题的兄弟吗?你们是怎么解决的?
有两个8G的显卡,显示这个错误,想问一下原因,谢谢
在运行python main.py train时出现如下问题,系统环境为ubuntu16.04+cuda9.0+cudnn7.0.5,百度之后发现该问题可能是因为cuda计算能力不够,cudnn需要计算能力达到3.0的cuda,但是cuda9.0的计算能力为2.1,是不足以支持的,但是在配置环境的时候网上有很多教程都是ubuntu16.04+cuda9.0+cudnn7.0.5,想问一下真的是cuda计算能力的问题吗还是别的问题
python main.py train --data-root=./data/train --use-gpu=True --env=classifier
Traceback (most recent call last):
File "main.py", line 170, in
import fire
File "C:\Users---\Anaconda2\envs\py36\lib\site-packages\fire\core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "C:\Users---\Anaconda2\envs\py36\lib\site-packages\fire\core.py", line 366, in _Fire
component, remaining_args)
File "C:\Users---\Anaconda2\envs\py36\lib\site-packages\fire\core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "main.py", line 48, in train
def train(**kwargs):
NameError: name 'opt' is not defined
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.