When I change setup to "A+H" and mode to "PA+SL" to run LSTm+CRF+PA+SL model, the cpu

The cpu memory ran out, not gpu. Below is output: <div class="snippet-clipboar

Out of memory issue about dsner-pytorch HOT 5 CLOSED

nooralahzadeh commented on July 19, 2024

Out of memory issue

from dsner-pytorch.

Comments (5)

nooralahzadeh commented on July 19, 2024

Hi
Maybe the partial-crf part makes this issue.
did you try with small batch size?
I didnot have problem with 16G GPU!

from dsner-pytorch.

wangdsh commented on July 19, 2024

How can I run the code in terminal?

I ran in "src" directory and other directories, but got error "ModuleNotFoundError: No module named 'src'". Did you run it in terminal or in IDE?

from dsner-pytorch.

wangdsh commented on July 19, 2024

The cpu memory ran out, not gpu. Below is output:

$ python dsner.py 
PA+SL
100%|███████████████████████████████████████████████████████████████████████████| 1097/1097 [00:00<00:00, 146762.51it/s]
100%|███████████████████████████████████████████████████████████████████████████| 1097/1097 [00:00<00:00, 164115.83it/s]
100%|███████████████████████████████████████████████████████████████████████████| 1097/1097 [00:00<00:00, 857623.76it/s]
[2019-11-27 21:57:07,179] DEBUG:__main__:==> Size of train data   : 1097 
100%|█████████████████████████████████████████████████████████████████████████████| 798/798 [00:00<00:00, 773885.45it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 798/798 [00:00<00:00, 895173.73it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 798/798 [00:00<00:00, 902171.05it/s]
[2019-11-27 21:57:07,281] DEBUG:__main__:==> Size of test data    : 798 
100%|█████████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 762947.52it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 835518.73it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 755730.45it/s]
[2019-11-27 21:57:07,339] DEBUG:__main__:==> Size of dev data    : 400 
100%|███████████████████████████████████████████████████████████████████████████| 2560/2560 [00:00<00:00, 782040.66it/s]
100%|███████████████████████████████████████████████████████████████████████████| 2560/2560 [00:00<00:00, 887756.78it/s]
100%|███████████████████████████████████████████████████████████████████████████| 2560/2560 [00:00<00:00, 924125.85it/s]
[2019-11-27 21:57:07,776] DEBUG:__main__:==> Size of ds pa data    : 2560 
[2019-11-27 21:57:07,968] DEBUG:__main__:==> Size of merge  data : 3657 
Training epoch  0:   0%|▎                                                             | 16/3657 [00:00<05:45, 10.54it/s]/pytorch/aten/src/ATen/native/cuda/LegacyDefinitions.cpp:19: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead.
/pytorch/aten/src/ATen/native/cuda/LegacyDefinitions.cpp:19: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead.
......
......
......
Training epoch  0:  66%|███████████████████████████████████████▊                    | 2430/3657 [01:30<03:05,  6.63it/s]Traceback (most recent call last):
  File "dsner.py", line 387, in <module>
    main()
  File "dsner.py", line 294, in main
    train_loss = trainer.train(dataset_setup, epoch)
  File "/data/wangdsh/temp/DSNER-pytorch/src/trainer.py", line 76, in train
    sent, tags, tags_iobes, sign, s_length, y_one_hot, y_iobes_one_hot = dataset[indices[start_index]]
  File "/data/wangdsh/temp/DSNER-pytorch/src/dataset.py", line 64, in __getitem__
    tags_iobes_one_hots=deepcopy(self.tags_iobes_one_hot[index])
  File "/data/Anaconda/Anaconda3/lib/python3.6/copy.py", line 161, in deepcopy
    y = copier(memo)
  File "/data/Anaconda/Anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 33, in __deepcopy__
    new_storage = self.storage().__deepcopy__(memo)
  File "/data/Anaconda/Anaconda3/lib/python3.6/site-packages/torch/storage.py", line 28, in __deepcopy__
    new_storage = self.clone()
  File "/data/Anaconda/Anaconda3/lib/python3.6/site-packages/torch/storage.py", line 44, in clone
    return type(self)(self.size()).copy_(self)
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 48272400 bytes. Error code 12 (Cannot allocate memory)

I changed batch_size to 16, but still got the same error. I think the reason is that variable "y_one_hot_all" consumes too much cpu memory.

from dsner-pytorch.

nooralahzadeh commented on July 19, 2024

It is weird, I run it again and I don't have this problem, However I am using the previous version of Pytorch =1.0.1!

from dsner-pytorch.

wangdsh commented on July 19, 2024

Thanks for your response. I install pytorch 1.0.1 with conda and run the code again, but I encounter the same problem. I think it's not the pytorch version issue.
My test environment:
python: Python 3.6.9 :: Anaconda, Inc.
cuda: CUDA Version 9.2.148
pytorch: 1.0.1

Besides, I change args "--setup" to "A+H", "--mode" to "PA+SL" in dsner.py. All others are the same.

from dsner-pytorch.

Out of memory issue about dsner-pytorch HOT 5 CLOSED

Comments (5)

Related Issues (5)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent