Giter VIP home page Giter VIP logo

Comments (13)

ruanchaves avatar ruanchaves commented on May 31, 2024

I got around the error by reducing the batch_size_for_train from 32 down to 8. Then I was able to run ./src/train.py as expected on my setup.

Maybe it would be a good idea to put on README.md
python3 ./src/train.py -num_epochs 1 -batch_size_for_train 1 -batch_size_for_eval 1

and, similarily,

CUDA_VISIBLE_DEVICES=0,1 python3 ./src/train.py -num_epochs 1 -batch_size_for_train 1 -batch_size_for_eval 1 -cuda_devices 0,1

As it is said that ~3 GB CPU and ~1.1GB GPU are necessary for running script. .

from zero-shot-entity-linking.

doudouzqx avatar doudouzqx commented on May 31, 2024

I have the same problem,It seems to use a single GPU training mode in encoders.
e.g. self.cuda_device = 0 and batch = nn_util.move_to_device(batch, self.cuda_device)
So even if use the CUDA_VISIBLE_DEVICES=0,1, In fact it used a singal GPU.
I want to konw how to use allennlp with multi GPU.Can you please help?

from zero-shot-entity-linking.

ruanchaves avatar ruanchaves commented on May 31, 2024

Only BiEncoderTopXRetriever at utils.py uses a single GPU.
train.py calls Trainer which uses all available GPUs by default.

I executed the command CUDA_VISIBLE_DEVICES=0,1 python3 ./src/train.py -num_epochs 1 -batch_size_for_train 1 -batch_size_for_eval 1 -cuda_devices 0,1 and then I had no problem training the model on multiple GPUs.

Did you forget to add -cuda_devices 0,1 at the end of your command?

from zero-shot-entity-linking.

doudouzqx avatar doudouzqx commented on May 31, 2024

I used this command
CUDA_VISIBLE_DEVICES=3,4,5 python3 train.py -num_epochs 1 -batch_size_for_train 8 -batch_size_for_eval 8 -cuda_devices 3,4,5

and Problem arise when Encoding all entites from title and description
experiment_logdir: ../src/experiment_logdir/201120_125315/ World american_football is now being loaded... 0%| | 0/1 [00:00<?, ?it/s]======Encoding all entites from title and description===== 0%| | 0/1 [00:13<?, ?it/s] Traceback (most recent call last):<01:03, 423.30it/s] File "train.py", line 190, in <module> main() File "train.py", line 83, in main hardNegativeSearcher.hardNegativesSearcherandSetter() File "/home/zhg/zqx/Zero-Shot-Entity-Linking-master/src/hardnegative_searcher.py", line 41, in hardNegativesSearcherandSetter dui2encoded_emb, duidx2encoded_emb = self.dui2EncoderEntityEmbReturner() File "/home/zhg/zqx/Zero-Shot-Entity-Linking-master/src/hardnegative_searcher.py", line 76, in dui2EncoderEntityEmbReturner duidx2encoded_emb = self.encodeAllEntitiesEncoder.encoding_all_entities() File "/home/zhg/zqx/Zero-Shot-Entity-Linking-master/src/encoders.py", line 129, in encoding_all_entities duidxs, embs = self._extract_cuidx_and_its_encoded_emb(batch) File "/home/zhg/zqx/Zero-Shot-Entity-Linking-master/src/encoders.py", line 141, in _extract_cuidx_and_its_encoded_emb out_dict = self.entity_encoder_wrapping_model(**batch) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/zqx/Zero-Shot-Entity-Linking-master/src/model.py", line 108, in forward encoded_entites = self.entity_encoder(title_and_desc_concatnated_text=title_and_desc_concatnated_text) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/zqx/Zero-Shot-Entity-Linking-master/src/encoders.py", line 46, in forward entity_emb = self.word_embedder(title_and_desc_concatnated_text) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 131, in forward token_vectors = embedder(*tensors, **forward_params_values) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/allennlp/modules/token_embedders/pretrained_transformer_embedder.py", line 26, in forward return self.transformer_model(token_ids)[0] File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 715, in forward head_mask=head_mask) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 437, in forward layer_outputs = layer_module(hidden_states, attention_mask, head_mask[i]) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 417, in forward intermediate_output = self.intermediate(attention_output) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 389, in forward hidden_states = self.intermediate_act_fn(hidden_states) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 142, in gelu return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0))) RuntimeError: CUDA out of memory. Tried to allocate 3.72 GiB (GPU 0; 10.76 GiB total capacity; 9.73 GiB already allocated; 143.12 MiB free; 9.77 GiB reserved in total by PyTorch) 16%|#5 | 4999/31929 [00:13<01:14, 363.16it/s]

the code in encoders.py seems use singal GPU

from zero-shot-entity-linking.

ruanchaves avatar ruanchaves commented on May 31, 2024

There are two workarounds:

  1. Use Docker containers. Execute docker run with the flag --gpus '"device=3,4,5"'. In this way the GPUs 3, 4 and 5 will be mapped to 0, 1 and 2 inside your container. More information here.

  2. If Docker containers are not available on your machine or if you are not familiar with Docker, you can simply do:

  • Replace self.cuda_device = 0 on line 201 of utils.py with self.cuda_device = 3

  • Replace self.cuda_device = 0 on line 107 of encoders.py with self.cuda_device = 3

from zero-shot-entity-linking.

doudouzqx avatar doudouzqx commented on May 31, 2024

Thank you for your help

from zero-shot-entity-linking.

DRosemei avatar DRosemei commented on May 31, 2024

100%|##########| 440473133/440473133 [01:12<00:00, 6085356.67B/s]

Hello, I want to know why your speed is so fast. Mine is shown below.
===PARAMETERS===
debug False
bert_name bert-base-uncased
word_embedding_dropout 0.05
cuda_devices 0
allen_lazyload True
batch_size_for_train 32
batch_size_for_eval 8
hard_negatives_num 10
num_epochs 1
lr 1e-05
weight_decay 0
beta1 0.9
beta2 0.999
epsilon 1e-08
amsgrad False
max_title_len 12
max_desc_len 50
max_context_len_after_tokenize 100
add_mse_for_biencoder False
search_method indexflatip
add_hard_negatives True
metionPooling CLS
entityPooling CLS
dimentionReduction False
dimentionReductionToThisDim 300
extracted_first_token_for_description 100
extracted_first_token_for_title 16
dataset_dir ./data/
documents_dir ./data/documents/
mentions_dir ./data/mentions/
mentions_splitbyworld_dir ./data/mentions_split_by_world/
mention_leftandright_tokenwindowwidth 40
debugSampleNum 100000000
dir_for_each_world ./data/worlds/
experiment_logdir ./src/experiment_logdir/
===PARAMETERS END===

experiment_logdir: ./src/experiment_logdir/201217_102331/
61%|##########################################3 | 266586112/440473133 [12:25<02:27, 1178137.12B/s]

Is it depend on CPUs?

from zero-shot-entity-linking.

ruanchaves avatar ruanchaves commented on May 31, 2024

I ran these experiments on two Tesla V100 GPUs at a NVIDIA DGX-1 32GB Server.
So yes, it depends on your setup.

By the way, @DRosemei and @doudouzqx , please let me know if you succeed in your experiments with the code on this repository or the BLINK repository. Although I was able to run the code and train the model, I couldn't achieve the results I was looking for.

from zero-shot-entity-linking.

DRosemei avatar DRosemei commented on May 31, 2024

@ruanchaves I meet a trouble now. I have downloaded model named "bert-base-uncased" , but I don't know where to put it.
Errors are shown below:
Model name 'bert-base-uncased' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz' was a path or url but couldn't find any file associated to this path or url.
Traceback (most recent call last):
File "./src/train.py", line 131, in
main()
File "./src/train.py", line 44, in main
mention_encoder = Pooler_for_mention(args=opts, word_embedder=textfieldEmbedder)
File "/media/rose/Doc/projects/xiaofan/Zero-Shot-Entity-Linking/src/encoders.py", line 63, in init
self.bertpooler_sec2vec = BertPooler(pretrained_model=self.bert_weight_filepath)
File "/home/rose/anaconda3/envs/el/lib/python3.7/site-packages/allennlp/modules/seq2vec_encoders/bert_pooler.py", line 51, in init
self.pooler = model.pooler
AttributeError: 'NoneType' object has no attribute 'pooler'

from zero-shot-entity-linking.

ruanchaves avatar ruanchaves commented on May 31, 2024

@DRosemei Can you post the command you are trying to run? What are your arguments to python3 ./src/train.py ?

from zero-shot-entity-linking.

DRosemei avatar DRosemei commented on May 31, 2024

@ruanchaves Yes, I used python3 ./src/train.py -num_epochs 1 , and I could train it now after I put "bert-base-uncased" to ./src/

from zero-shot-entity-linking.

DRosemei avatar DRosemei commented on May 31, 2024

@ruanchaves I have completed 1 epoch, and I get final results below:
{
"entire_h1_percent": 20.28,
"entire_h10_percent": 42.88,
"entire_h50_percent": 54.42,
"entire_h64_percent": 55.96,
"entire_h100_percent": 59.440000000000005,
"entire_h500_percent": 71.00999999999999
}
The results are not so good. Have you ever trained more than 1 epoch?

from zero-shot-entity-linking.

ruanchaves avatar ruanchaves commented on May 31, 2024

Yes, I have already trained for some epochs. I couldn't achieve acceptable results.

from zero-shot-entity-linking.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.