Issue deion Performing the standard experiment on README.md

Only BiEncoderTopXRetriever at <code class="notransla

There are two workarounds: Use Docker container

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Bug] CUDA out-of-memory error on hard entity mining about zero-shot-entity-linking HOT 13 CLOSED

izuna385 commented on May 31, 2024

[Bug] CUDA out-of-memory error on hard entity mining

from zero-shot-entity-linking.

Comments (13)

ruanchaves commented on May 31, 2024

I got around the error by reducing the batch_size_for_train from 32 down to 8. Then I was able to run ./src/train.py as expected on my setup.

Maybe it would be a good idea to put on README.md
python3 ./src/train.py -num_epochs 1 -batch_size_for_train 1 -batch_size_for_eval 1

and, similarily,

CUDA_VISIBLE_DEVICES=0,1 python3 ./src/train.py -num_epochs 1 -batch_size_for_train 1 -batch_size_for_eval 1 -cuda_devices 0,1

As it is said that ~3 GB CPU and ~1.1GB GPU are necessary for running script. .

from zero-shot-entity-linking.

doudouzqx commented on May 31, 2024

I have the same problem，It seems to use a single GPU training mode in encoders.
e.g. self.cuda_device = 0 and batch = nn_util.move_to_device(batch, self.cuda_device)
So even if use the CUDA_VISIBLE_DEVICES=0,1, In fact it used a singal GPU.
I want to konw how to use allennlp with multi GPU.Can you please help?

from zero-shot-entity-linking.

ruanchaves commented on May 31, 2024

Only BiEncoderTopXRetriever at utils.py uses a single GPU.
train.py calls Trainer which uses all available GPUs by default.

I executed the command CUDA_VISIBLE_DEVICES=0,1 python3 ./src/train.py -num_epochs 1 -batch_size_for_train 1 -batch_size_for_eval 1 -cuda_devices 0,1 and then I had no problem training the model on multiple GPUs.

Did you forget to add -cuda_devices 0,1 at the end of your command?

from zero-shot-entity-linking.

doudouzqx commented on May 31, 2024

I used this command
CUDA_VISIBLE_DEVICES=3,4,5 python3 train.py -num_epochs 1 -batch_size_for_train 8 -batch_size_for_eval 8 -cuda_devices 3,4,5

and Problem arise when Encoding all entites from title and description
experiment_logdir: ../src/experiment_logdir/201120_125315/ World american_football is now being loaded... 0%| | 0/1 [00:00<?, ?it/s]======Encoding all entites from title and description===== 0%| | 0/1 [00:13<?, ?it/s] Traceback (most recent call last):<01:03, 423.30it/s] File "train.py", line 190, in <module> main() File "train.py", line 83, in main hardNegativeSearcher.hardNegativesSearcherandSetter() File "/home/zhg/zqx/Zero-Shot-Entity-Linking-master/src/hardnegative_searcher.py", line 41, in hardNegativesSearcherandSetter dui2encoded_emb, duidx2encoded_emb = self.dui2EncoderEntityEmbReturner() File "/home/zhg/zqx/Zero-Shot-Entity-Linking-master/src/hardnegative_searcher.py", line 76, in dui2EncoderEntityEmbReturner duidx2encoded_emb = self.encodeAllEntitiesEncoder.encoding_all_entities() File "/home/zhg/zqx/Zero-Shot-Entity-Linking-master/src/encoders.py", line 129, in encoding_all_entities duidxs, embs = self._extract_cuidx_and_its_encoded_emb(batch) File "/home/zhg/zqx/Zero-Shot-Entity-Linking-master/src/encoders.py", line 141, in _extract_cuidx_and_its_encoded_emb out_dict = self.entity_encoder_wrapping_model(**batch) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/zqx/Zero-Shot-Entity-Linking-master/src/model.py", line 108, in forward encoded_entites = self.entity_encoder(title_and_desc_concatnated_text=title_and_desc_concatnated_text) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/zqx/Zero-Shot-Entity-Linking-master/src/encoders.py", line 46, in forward entity_emb = self.word_embedder(title_and_desc_concatnated_text) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 131, in forward token_vectors = embedder(*tensors, **forward_params_values) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/allennlp/modules/token_embedders/pretrained_transformer_embedder.py", line 26, in forward return self.transformer_model(token_ids)[0] File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 715, in forward head_mask=head_mask) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 437, in forward layer_outputs = layer_module(hidden_states, attention_mask, head_mask[i]) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 417, in forward intermediate_output = self.intermediate(attention_output) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 389, in forward hidden_states = self.intermediate_act_fn(hidden_states) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 142, in gelu return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0))) RuntimeError: CUDA out of memory. Tried to allocate 3.72 GiB (GPU 0; 10.76 GiB total capacity; 9.73 GiB already allocated; 143.12 MiB free; 9.77 GiB reserved in total by PyTorch) 16%|#5 | 4999/31929 [00:13<01:14, 363.16it/s]

the code in encoders.py seems use singal GPU

from zero-shot-entity-linking.

ruanchaves commented on May 31, 2024

There are two workarounds:

Use Docker containers. Execute docker run with the flag --gpus '"device=3,4,5"'. In this way the GPUs 3, 4 and 5 will be mapped to 0, 1 and 2 inside your container. More information here.
If Docker containers are not available on your machine or if you are not familiar with Docker, you can simply do:

Replace self.cuda_device = 0 on line 201 of utils.py with self.cuda_device = 3
Replace self.cuda_device = 0 on line 107 of encoders.py with self.cuda_device = 3

from zero-shot-entity-linking.

doudouzqx commented on May 31, 2024

Thank you for your help

from zero-shot-entity-linking.

DRosemei commented on May 31, 2024

100%|##########| 440473133/440473133 [01:12<00:00, 6085356.67B/s]

Hello, I want to know why your speed is so fast. Mine is shown below.
===PARAMETERS===
debug False
bert_name bert-base-uncased
word_embedding_dropout 0.05
cuda_devices 0
allen_lazyload True
batch_size_for_train 32
batch_size_for_eval 8
hard_negatives_num 10
num_epochs 1
lr 1e-05
weight_decay 0
beta1 0.9
beta2 0.999
epsilon 1e-08
amsgrad False
max_title_len 12
max_desc_len 50
max_context_len_after_tokenize 100
add_mse_for_biencoder False
search_method indexflatip
add_hard_negatives True
metionPooling CLS
entityPooling CLS
dimentionReduction False
dimentionReductionToThisDim 300
extracted_first_token_for_description 100
extracted_first_token_for_title 16
dataset_dir ./data/
documents_dir ./data/documents/
mentions_dir ./data/mentions/
mentions_splitbyworld_dir ./data/mentions_split_by_world/
mention_leftandright_tokenwindowwidth 40
debugSampleNum 100000000
dir_for_each_world ./data/worlds/
experiment_logdir ./src/experiment_logdir/
===PARAMETERS END===

experiment_logdir: ./src/experiment_logdir/201217_102331/
61%|##########################################3 | 266586112/440473133 [12:25<02:27, 1178137.12B/s]

Is it depend on CPUs?

from zero-shot-entity-linking.

ruanchaves commented on May 31, 2024

I ran these experiments on two Tesla V100 GPUs at a NVIDIA DGX-1 32GB Server.
So yes, it depends on your setup.

By the way, @DRosemei and @doudouzqx , please let me know if you succeed in your experiments with the code on this repository or the BLINK repository. Although I was able to run the code and train the model, I couldn't achieve the results I was looking for.

from zero-shot-entity-linking.

DRosemei commented on May 31, 2024

@ruanchaves I meet a trouble now. I have downloaded model named "bert-base-uncased" , but I don't know where to put it.
Errors are shown below:
Model name 'bert-base-uncased' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz' was a path or url but couldn't find any file associated to this path or url.
Traceback (most recent call last):
File "./src/train.py", line 131, in
main()
File "./src/train.py", line 44, in main
mention_encoder = Pooler_for_mention(args=opts, word_embedder=textfieldEmbedder)
File "/media/rose/Doc/projects/xiaofan/Zero-Shot-Entity-Linking/src/encoders.py", line 63, in init
self.bertpooler_sec2vec = BertPooler(pretrained_model=self.bert_weight_filepath)
File "/home/rose/anaconda3/envs/el/lib/python3.7/site-packages/allennlp/modules/seq2vec_encoders/bert_pooler.py", line 51, in init
self.pooler = model.pooler
AttributeError: 'NoneType' object has no attribute 'pooler'

from zero-shot-entity-linking.

ruanchaves commented on May 31, 2024

@DRosemei Can you post the command you are trying to run? What are your arguments to python3 ./src/train.py ?

from zero-shot-entity-linking.

DRosemei commented on May 31, 2024

@ruanchaves Yes, I used python3 ./src/train.py -num_epochs 1 , and I could train it now after I put "bert-base-uncased" to ./src/

from zero-shot-entity-linking.

DRosemei commented on May 31, 2024

@ruanchaves I have completed 1 epoch, and I get final results below:
{
"entire_h1_percent": 20.28,
"entire_h10_percent": 42.88,
"entire_h50_percent": 54.42,
"entire_h64_percent": 55.96,
"entire_h100_percent": 59.440000000000005,
"entire_h500_percent": 71.00999999999999
}
The results are not so good. Have you ever trained more than 1 epoch?

from zero-shot-entity-linking.

ruanchaves commented on May 31, 2024

Yes, I have already trained for some epochs. I couldn't achieve acceptable results.

from zero-shot-entity-linking.

[Bug] CUDA out-of-memory error on hard entity mining about zero-shot-entity-linking HOT 13 CLOSED

Comments (13)

Related Issues (6)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent