Comments (22)
The error probably happened when it's searching for the right spacy
or scikit-learn
versions and failed in one of those versions. I assume after these error lines, it probably found a correct version to install, and that's why the finally installation looks correct. I'm not sure if it's the normal behavior for pip
to search for all versions and print out the error tho.
from capreolus.
@larryli1999 oh that line alone could probably be solved by changing the benchmark.qrels[qid]
to benchmark.qrels.get(qid, {})
. sorry I haven't tested the other benchmarks on this branch.
I'll run python -m capreolus.run rerank.traineval with file=docs/reproduction/config_parade_small.txt fold=s1
on my end in case there is any other issue. update u once I confirmed.
from capreolus.
@larryli1999 hey sorry for seeing this late. I tried the command and encounter the same error. and the Warning should not be the issue tho if it shows in the early stage (before training?).
I'm checking the possible reasons now.
from capreolus.
However, the import error does not seem to affect the final installation of Capreolus. By running the set-up check, here is what been returned:
from capreolus.
Hi @crystina-z , I also run into some issues when I tried to reproduce result from Reranking robust04 with PARADE. The error message is:
cdr2602% python -m capreolus.run rerank.traineval with file=docs/reproduction/config_parade_small.txt fold=s1 Traceback (most recent call last): File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/scratch/larry/capreolus/capreolus/run.py", line 96, in <module> task, task_entry_function = prepare_task(arguments["COMMAND"], config) File "/scratch/larry/capreolus/capreolus/run.py", line 34, in prepare_task task = Task.create(taskstr, config) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 222, in create module_obj = module_cls(config, provide, share_dependency_objects=share_objects) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 274, in __init__ self._instantiate_dependencies(self.config, provide, share_dependency_objects) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 315, in _instantiate_dependencies dependency_name, dependency_config, provide=provide, share_objects=share_objects File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 222, in create module_obj = module_cls(config, provide, share_dependency_objects=share_objects) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 274, in __init__ self._instantiate_dependencies(self.config, provide, share_dependency_objects) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 315, in _instantiate_dependencies dependency_name, dependency_config, provide=provide, share_objects=share_objects File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 222, in create module_obj = module_cls(config, provide, share_dependency_objects=share_objects) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 274, in __init__ self._instantiate_dependencies(self.config, provide, share_dependency_objects) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 315, in _instantiate_dependencies dependency_name, dependency_config, provide=provide, share_objects=share_objects File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 222, in create module_obj = module_cls(config, provide, share_dependency_objects=share_objects) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 279, in __init__ self.build() File "/scratch/larry/capreolus/capreolus/tokenizer/bert.py", line 15, in build self.bert_tokenizer = AutoTokenizer.from_pretrained(self.config["pretrained"], use_fast=True) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/transformers/tokenization_auto.py", line 218, in from_pretrained return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1425, in from_pretrained return cls._from_pretrained(*inputs, **kwargs) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1572, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/transformers/tokenization_bert.py", line 639, in __init__ wordpieces_prefix=wordpieces_prefix, TypeError: __init__() got an unexpected keyword argument 'vocab_file'
Do you think this is caused by the version issue from the transformers and the tokenizer?
from capreolus.
Yeah, this looks like a version mismatch to me. The transformers
API changes often.
Also, I think it's better to use this config: https://github.com/capreolus-ir/capreolus/blob/master/docs/reproduction/config_parade_long-robust04_title.txt
We've been noticing a lot of variance with the small one, so I don't think trying to reproduce it is very useful. I plan on updating the repro docs to remove the small config once someone has confirmed that's the issue and this one works.
from capreolus.
@andrewyates , another import error I found is ImportError: cannot import name 'parameter_server_strategy_v2' from 'tensorflow.python.distribute'
. I have installed tensorflow 2.3.0 which only supports parameter_server_strategy in the library. I was wondering what is the difference between v2 and the base version? Should I upgrade or downgrade the tensorflow version in order to support that import?
from capreolus.
I think this is also a version issue and you need tensorflow 2.4.x. Tensorflow v1 has a completely different API, but I'm not sure if CC uses that as the base version.
You can find versions for all required packages in requirements.txt.
from capreolus.
Looks like I still encounter the version mismatch problem when running with config_parade_long-robust04_title.txt
.
Error message: File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/transformers/tokenization_bert.py", line 639, in __init__ wordpieces_prefix=wordpieces_prefix, TypeError: __init__() got an unexpected keyword argument 'vocab_file'
from capreolus.
I think this is due to the wrong version of the transformers
library. For some reason you don't seem to have the versions specified in requirements.txt
. @crystina-z might be able to help if this is some issue with CC not installing the right ones
from capreolus.
@larryli1999 sorry for this inconvenience, just tried the script on cedar and those two errors should be solved by changing the tokenizer
and tensorflow-estimator
packages to the following version.
tokenizers 0.8.1
tensorflow-estimator 2.3.0
from capreolus.
@crystina-z , thanks for your response and the mismatch issue seems to be resolved. However, I have encountered a new issue:
Have you encountered this error before?
from capreolus.
@crystina-z on cedar, I am able to train the model in 3.5 hours with a v100 gpu for 1 epoch. However, python -m capreolus.run rerank.traineval with file=docs/reproduction/config_parade_small.txt fold=s1
requires to train for 4 epochs. Is the training time in the documentation for 1 epoch or for a total of 4 epochs?
from capreolus.
@larryli1999 I guess it's for 4 epochs. Is your 3.5 hour solely for the first epoch training or it also includes the data preprocessing time? iirc CC are relatively slower in disk access, maybe that could be the reason?
from capreolus.
The running time of the process bar returned from tqdm gives about 3 hours of training time for the first epoch (256 iteration size). The data preprocessing I observed before the training takes about 5 min max. Do you think that the slow disk access might also slow down the training speed? I tried to train with 4 v100 GPUs, but the speed-up is not quite significant (takes 2.5 hours for 1 epoch).
from capreolus.
@larryli1999 In that case no I don't think it would affect. but the number of CPU requested might? I found in my previous running log, with 8 cpus it takes ~16 minutes to run 3k training steps on ms marco dataset.
from capreolus.
@crystina-z I think I know the reason why the training takes so long. After I request gpu using slurm , I am able to see the gpu details by running nvidia-smi
. However, when I check the available gpu device in tensorflow, the returned statement is an empty list, which means that the training only takes place on cpu. Is the installed tensorflow 2.3.0 cpu only?
from capreolus.
@larryli1999 I doubt if it's because of the tensorflow version tho. checked the package on my side:
>>> import tensorflow as tf
2021-05-13 10:25:48.738515: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>>> tf.__file__
'/home/czhang/miniconda3/envs/capreolus/lib/python3.7/site-packages/tensorflow/__init__.py'
>>>> tf.test.is_gpu_available()
WARNING:tensorflow:From <stdin>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.Instructions for updating:Use `tf.config.list_physical_devices('GPU')` instead.2021-05-13 11:34:09.519522: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.2021-05-13 11:34:09.536950: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2095130000 Hz
2021-05-13 11:34:09.537372: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555937482620 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-05-13 11:34:09.537565: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version2021-05-13 11:34:09.548074: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-05-13 11:34:09.708821: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555937492420 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-05-13 11:34:09.708880: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla P100-PCIE-12GB, Compute Capability 6.0
2021-05-13 11:34:09.710115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:04:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-05-13 11:34:09.710182: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-05-13 11:34:09.725928: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-05-13 11:34:09.746139: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-05-13 11:34:09.751804: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-05-13 11:34:09.785194: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-05-13 11:34:09.791210: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-05-13 11:34:09.821331: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-05-13 11:34:09.823868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-05-13 11:34:09.824048: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-05-13 11:34:10.380221: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-13 11:34:10.380410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
2021-05-13 11:34:10.380508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
2021-05-13 11:34:10.382599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/device:GPU:0 with 11121 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:04:00.0, compute capability: 6.0)
True
also according to the doc, seems cpu and gpu tensorflow package are separated only "for releases 1.15 and older".
I wonder if there is other error or warning message in the log file tho? say about loading cuda etc.
from capreolus.
@crystina-z I think my cuda is not loaded for some reason when I import tensorflow comparing to your outputs. However, I do have cudatoolkit 10.1.243
installed in the conda environment. I will try to reinstall cuda to see if it fix the problem
Another interesting finding is Pytorch somehow can detect the gpu.
from capreolus.
@crystina-z I noticed that I install tensorflow 2.4.1 instead of 2.3.0 to resolve this issue ImportError: cannot import name 'parameter_server_strategy_v2' from 'tensorflow.python.distribute
. Therefore, the cuda version does not match with the tensorflow version which results in gpu not being detected. I was wondering which version of tensorflow, cuda, and cudnn did you install? And is there any other solution to resolve the issue of ImportError: cannot import name 'parameter_server_strategy_v2' from 'tensorflow.python.distribute
besides upgrading the tensorflow version?
from capreolus.
@crystina-z , I am able to fix the gpu issue by installing tensorflow-gpu and update cudnn. However, the metrics I got from running with config_parade_small.txt
is quite lower than the documention:
INFO - capreolus.task.rerank.evaluate - rerank: fold=s1 dev metrics: MRR@10=0.980 P_1=0.204 P_10=0.204 P_20=0.213 P_5=0.176 judged_10=0.986 judged_20=0.982 judged_200=0.930 map=0.115 ndcg_cut_10=0.187 ndcg_cut_20=0.203 ndcg_cut_5=0.164 recall_100=0.444 recall_1000=0.444 recip_rank=0.343
INFO - capreolus.task.rerank.evaluate - rerank: fold=s1 test metrics: MRR@10=1.000 P_1=0.220 P_10=0.200 P_20=0.213 P_5=0.172 judged_10=0.972 judged_20=0.968 judged_200=0.918 map=0.132 ndcg_cut_10=0.182 ndcg_cut_20=0.206 ndcg_cut_5=0.163 recall_100=0.461 recall_1000=0.461 recip_rank=0.363
I also have some warning messages like WARNING - capreolus.evaluator._eval_runs - Queries mismatch in qrels and runs: Number of queries in qrels: 50; Number of queries in runs: 49; Number of overlap queries: 49.
and WARNING - capreolus.evaluator._eval_runs - Queries mismatch in qrels and runs: Number of queries in qrels: 200; Number of queries in runs: 250; Number of overlap queries: 200.
.
Do you think those might be the reason for the lower metrics?
from capreolus.
The parade training should achieve expected scores now as here. Closing this issue for now. Feel free to reopen if it's still problematic :)
from capreolus.
Related Issues (20)
- MSMARCO python index error when reranking HOT 5
- What are lr scheduler variables? HOT 1
- monoBERT resource allocation HOT 2
- MonoBERT MSMARCO error HOT 1
- How much GPU/time/other configs do I need for running the monoBERT MS Marco experiment? HOT 6
- tfrecord even after training? HOT 6
- About issue #168 HOT 3
- Generate Ranked List of Train Data HOT 1
- Increase Number of Epochs HOT 1
- Smaller BERT Models HOT 1
- Spec for Running Tensorflow Version on Multiple GPUs HOT 1
- Spec of System for Running on GPU Quadro 8000 HOT 1
- Cannot set up capreolus on Compute Canada HOT 7
- capreolus.eval is not a module HOT 2
- How to use capreolus for Robust evaluation? HOT 1
- Which one of "ndcg_cut_20" and "ndcg_cun_20 [interp]" is the result in reproduces document? HOT 1
- MRR@10=0.35 not achieved on fine-tuning monoBERT task HOT 13
- Setup Issues with Capreolus on Canada Compute
- Unable to install annoy which is a dependency of capreolus HOT 1
- Version Conflict between python/scipy-stack in Capreolus Installation on Compute Canada HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from capreolus.