Giter VIP home page Giter VIP logo

Comments (22)

crystina-z avatar crystina-z commented on June 24, 2024 1

The error probably happened when it's searching for the right spacy or scikit-learn versions and failed in one of those versions. I assume after these error lines, it probably found a correct version to install, and that's why the finally installation looks correct. I'm not sure if it's the normal behavior for pip to search for all versions and print out the error tho.

from capreolus.

crystina-z avatar crystina-z commented on June 24, 2024 1

@larryli1999 oh that line alone could probably be solved by changing the benchmark.qrels[qid] to benchmark.qrels.get(qid, {}). sorry I haven't tested the other benchmarks on this branch.

I'll run python -m capreolus.run rerank.traineval with file=docs/reproduction/config_parade_small.txt fold=s1 on my end in case there is any other issue. update u once I confirmed.

from capreolus.

crystina-z avatar crystina-z commented on June 24, 2024 1

@larryli1999 hey sorry for seeing this late. I tried the command and encounter the same error. and the Warning should not be the issue tho if it shows in the early stage (before training?).
I'm checking the possible reasons now.

from capreolus.

larryli1999 avatar larryli1999 commented on June 24, 2024

However, the import error does not seem to affect the final installation of Capreolus. By running the set-up check, here is what been returned:

Capture5

from capreolus.

larryli1999 avatar larryli1999 commented on June 24, 2024

Hi @crystina-z , I also run into some issues when I tried to reproduce result from Reranking robust04 with PARADE. The error message is:
cdr2602% python -m capreolus.run rerank.traineval with file=docs/reproduction/config_parade_small.txt fold=s1 Traceback (most recent call last): File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/scratch/larry/capreolus/capreolus/run.py", line 96, in <module> task, task_entry_function = prepare_task(arguments["COMMAND"], config) File "/scratch/larry/capreolus/capreolus/run.py", line 34, in prepare_task task = Task.create(taskstr, config) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 222, in create module_obj = module_cls(config, provide, share_dependency_objects=share_objects) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 274, in __init__ self._instantiate_dependencies(self.config, provide, share_dependency_objects) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 315, in _instantiate_dependencies dependency_name, dependency_config, provide=provide, share_objects=share_objects File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 222, in create module_obj = module_cls(config, provide, share_dependency_objects=share_objects) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 274, in __init__ self._instantiate_dependencies(self.config, provide, share_dependency_objects) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 315, in _instantiate_dependencies dependency_name, dependency_config, provide=provide, share_objects=share_objects File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 222, in create module_obj = module_cls(config, provide, share_dependency_objects=share_objects) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 274, in __init__ self._instantiate_dependencies(self.config, provide, share_dependency_objects) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 315, in _instantiate_dependencies dependency_name, dependency_config, provide=provide, share_objects=share_objects File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 222, in create module_obj = module_cls(config, provide, share_dependency_objects=share_objects) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/profane/base.py", line 279, in __init__ self.build() File "/scratch/larry/capreolus/capreolus/tokenizer/bert.py", line 15, in build self.bert_tokenizer = AutoTokenizer.from_pretrained(self.config["pretrained"], use_fast=True) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/transformers/tokenization_auto.py", line 218, in from_pretrained return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1425, in from_pretrained return cls._from_pretrained(*inputs, **kwargs) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1572, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/transformers/tokenization_bert.py", line 639, in __init__ wordpieces_prefix=wordpieces_prefix, TypeError: __init__() got an unexpected keyword argument 'vocab_file'

Do you think this is caused by the version issue from the transformers and the tokenizer?

from capreolus.

andrewyates avatar andrewyates commented on June 24, 2024

Yeah, this looks like a version mismatch to me. The transformers API changes often.

Also, I think it's better to use this config: https://github.com/capreolus-ir/capreolus/blob/master/docs/reproduction/config_parade_long-robust04_title.txt
We've been noticing a lot of variance with the small one, so I don't think trying to reproduce it is very useful. I plan on updating the repro docs to remove the small config once someone has confirmed that's the issue and this one works.

from capreolus.

larryli1999 avatar larryli1999 commented on June 24, 2024

@andrewyates , another import error I found is ImportError: cannot import name 'parameter_server_strategy_v2' from 'tensorflow.python.distribute'. I have installed tensorflow 2.3.0 which only supports parameter_server_strategy in the library. I was wondering what is the difference between v2 and the base version? Should I upgrade or downgrade the tensorflow version in order to support that import?

from capreolus.

andrewyates avatar andrewyates commented on June 24, 2024

I think this is also a version issue and you need tensorflow 2.4.x. Tensorflow v1 has a completely different API, but I'm not sure if CC uses that as the base version.

You can find versions for all required packages in requirements.txt.

from capreolus.

larryli1999 avatar larryli1999 commented on June 24, 2024

Looks like I still encounter the version mismatch problem when running with config_parade_long-robust04_title.txt .
Error message: File "/home/larry/anaconda3/envs/capreolus/lib/python3.7/site-packages/transformers/tokenization_bert.py", line 639, in __init__ wordpieces_prefix=wordpieces_prefix, TypeError: __init__() got an unexpected keyword argument 'vocab_file'

from capreolus.

andrewyates avatar andrewyates commented on June 24, 2024

I think this is due to the wrong version of the transformers library. For some reason you don't seem to have the versions specified in requirements.txt. @crystina-z might be able to help if this is some issue with CC not installing the right ones

from capreolus.

crystina-z avatar crystina-z commented on June 24, 2024

@larryli1999 sorry for this inconvenience, just tried the script on cedar and those two errors should be solved by changing the tokenizer and tensorflow-estimator packages to the following version.

tokenizers                    0.8.1
tensorflow-estimator          2.3.0

from capreolus.

larryli1999 avatar larryli1999 commented on June 24, 2024

@crystina-z , thanks for your response and the mismatch issue seems to be resolved. However, I have encountered a new issue:
Capture6

Have you encountered this error before?

from capreolus.

larryli1999 avatar larryli1999 commented on June 24, 2024

@crystina-z on cedar, I am able to train the model in 3.5 hours with a v100 gpu for 1 epoch. However, python -m capreolus.run rerank.traineval with file=docs/reproduction/config_parade_small.txt fold=s1 requires to train for 4 epochs. Is the training time in the documentation for 1 epoch or for a total of 4 epochs?

from capreolus.

crystina-z avatar crystina-z commented on June 24, 2024

@larryli1999 I guess it's for 4 epochs. Is your 3.5 hour solely for the first epoch training or it also includes the data preprocessing time? iirc CC are relatively slower in disk access, maybe that could be the reason?

from capreolus.

larryli1999 avatar larryli1999 commented on June 24, 2024

The running time of the process bar returned from tqdm gives about 3 hours of training time for the first epoch (256 iteration size). The data preprocessing I observed before the training takes about 5 min max. Do you think that the slow disk access might also slow down the training speed? I tried to train with 4 v100 GPUs, but the speed-up is not quite significant (takes 2.5 hours for 1 epoch).

Something like this:
Capture7

from capreolus.

crystina-z avatar crystina-z commented on June 24, 2024

@larryli1999 In that case no I don't think it would affect. but the number of CPU requested might? I found in my previous running log, with 8 cpus it takes ~16 minutes to run 3k training steps on ms marco dataset.

from capreolus.

larryli1999 avatar larryli1999 commented on June 24, 2024

@crystina-z I think I know the reason why the training takes so long. After I request gpu using slurm , I am able to see the gpu details by running nvidia-smi. However, when I check the available gpu device in tensorflow, the returned statement is an empty list, which means that the training only takes place on cpu. Is the installed tensorflow 2.3.0 cpu only?

from capreolus.

crystina-z avatar crystina-z commented on June 24, 2024

@larryli1999 I doubt if it's because of the tensorflow version tho. checked the package on my side:

>>> import tensorflow as tf
2021-05-13 10:25:48.738515: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>>> tf.__file__
'/home/czhang/miniconda3/envs/capreolus/lib/python3.7/site-packages/tensorflow/__init__.py'
>>>> tf.test.is_gpu_available()
WARNING:tensorflow:From <stdin>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.Instructions for updating:Use `tf.config.list_physical_devices('GPU')` instead.2021-05-13 11:34:09.519522: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.2021-05-13 11:34:09.536950: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2095130000 Hz
2021-05-13 11:34:09.537372: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555937482620 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-05-13 11:34:09.537565: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version2021-05-13 11:34:09.548074: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-05-13 11:34:09.708821: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555937492420 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-05-13 11:34:09.708880: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla P100-PCIE-12GB, Compute Capability 6.0
2021-05-13 11:34:09.710115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:04:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-05-13 11:34:09.710182: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-05-13 11:34:09.725928: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-05-13 11:34:09.746139: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-05-13 11:34:09.751804: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-05-13 11:34:09.785194: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-05-13 11:34:09.791210: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-05-13 11:34:09.821331: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-05-13 11:34:09.823868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-05-13 11:34:09.824048: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-05-13 11:34:10.380221: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-13 11:34:10.380410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2021-05-13 11:34:10.380508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
2021-05-13 11:34:10.382599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/device:GPU:0 with 11121 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:04:00.0, compute capability: 6.0)
True

also according to the doc, seems cpu and gpu tensorflow package are separated only "for releases 1.15 and older".

I wonder if there is other error or warning message in the log file tho? say about loading cuda etc.

from capreolus.

larryli1999 avatar larryli1999 commented on June 24, 2024

@crystina-z I think my cuda is not loaded for some reason when I import tensorflow comparing to your outputs. However, I do have cudatoolkit 10.1.243 installed in the conda environment. I will try to reinstall cuda to see if it fix the problem

Another interesting finding is Pytorch somehow can detect the gpu.

from capreolus.

larryli1999 avatar larryli1999 commented on June 24, 2024

@crystina-z I noticed that I install tensorflow 2.4.1 instead of 2.3.0 to resolve this issue ImportError: cannot import name 'parameter_server_strategy_v2' from 'tensorflow.python.distribute. Therefore, the cuda version does not match with the tensorflow version which results in gpu not being detected. I was wondering which version of tensorflow, cuda, and cudnn did you install? And is there any other solution to resolve the issue of ImportError: cannot import name 'parameter_server_strategy_v2' from 'tensorflow.python.distribute besides upgrading the tensorflow version?

from capreolus.

larryli1999 avatar larryli1999 commented on June 24, 2024

@crystina-z , I am able to fix the gpu issue by installing tensorflow-gpu and update cudnn. However, the metrics I got from running with config_parade_small.txt is quite lower than the documention:
INFO - capreolus.task.rerank.evaluate - rerank: fold=s1 dev metrics: MRR@10=0.980 P_1=0.204 P_10=0.204 P_20=0.213 P_5=0.176 judged_10=0.986 judged_20=0.982 judged_200=0.930 map=0.115 ndcg_cut_10=0.187 ndcg_cut_20=0.203 ndcg_cut_5=0.164 recall_100=0.444 recall_1000=0.444 recip_rank=0.343

INFO - capreolus.task.rerank.evaluate - rerank: fold=s1 test metrics: MRR@10=1.000 P_1=0.220 P_10=0.200 P_20=0.213 P_5=0.172 judged_10=0.972 judged_20=0.968 judged_200=0.918 map=0.132 ndcg_cut_10=0.182 ndcg_cut_20=0.206 ndcg_cut_5=0.163 recall_100=0.461 recall_1000=0.461 recip_rank=0.363

I also have some warning messages like WARNING - capreolus.evaluator._eval_runs - Queries mismatch in qrels and runs: Number of queries in qrels: 50; Number of queries in runs: 49; Number of overlap queries: 49. and WARNING - capreolus.evaluator._eval_runs - Queries mismatch in qrels and runs: Number of queries in qrels: 200; Number of queries in runs: 250; Number of overlap queries: 200..

Do you think those might be the reason for the lower metrics?

from capreolus.

crystina-z avatar crystina-z commented on June 24, 2024

The parade training should achieve expected scores now as here. Closing this issue for now. Feel free to reopen if it's still problematic :)

from capreolus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.