microsoft / ance Goto Github PK

A novel embedding training algorithm leveraging ANN search and achieved SOTA retrieval on Trec DL 2019 and OpenQA benchmarks

License: MIT License

Shell 3.67% Python 92.44% Jupyter Notebook 3.88%

ance's Issues

Can't reproduce the performance of warmup(60k)

Hello, I used run_train_warmup.sh to train the warmup model and found that the performance of my model can not achieve the effect of your released checkpoint (pretrained BM25 warmup checkpoint MRR@10 is 0.311), even if I train it to 300k steps (MRR@10 is 0.2979). All training hyperparameters are as described in run_train_warmup.sh, how can I handle this?

How long does inference take?

Hello developers,
I followed the guidelines in your ReadMe to generate the dense representations for MS Marco Document Ranking, using the MaxP checkpoint that you provide. My process has been running for more than 80 hours, on a server with a T4 Tesla GPU and Intel Xeon Platinum CPU (looking at htop, I observe that it is running with a single thread). Is such a long inference time normal? Am I missing something to speedup this process?

some confusion about msmarco_data.py

it seems that offset_file of dev_query and train_query will be written into the same file, so the former one will be overwrited?

ANCE/data/msmarco_data.py

Lines 77 to 80 in 936ec3e

 qid2offset_path = os.path.join( 

 args.out_data_dir, 

 "qid2offset.pickle", 

 )

@microsoftopensource

Using testset to test NDCG while training

Hi,

I found that while doing the TREC DL document task, the code in the msmarco.py processes "msmarco-test2019-queries.tsv" as the dev-query file.

ANCE/data/msmarco_data.py

Line 190 in 936ec3e

if args.data_type == 0:

    if args.data_type == 0:
        write_query_rel(
            args,
            pid2offset,
            "msmarco-doctrain-queries.tsv",
            "msmarco-doctrain-qrels.tsv",
            "train-query",
            "train-qrel.tsv")
        write_query_rel(
            args,
            pid2offset,
            "msmarco-test2019-queries.tsv",
            "2019qrels-docs.txt",
            "dev-query",
            "dev-qrel.tsv")

If I want to reproduce your work, is it okay to use the "msmarco-docdev-queries.tsv" as devset to select the best checkpoint?

Provide pretrained BM25 checkpoint for TREC Document

Hi,

It's great work. Much thanks to the codes and data.

I'll be very grateful if you can also provide the pretrained BM25 checkpoint for TREC Document.

module 'transformers' has no attribute 'TFRobertaDot_NLL_LN'

HI,
I met a problem when I was running the run_train_warmup.sh,it appeared:

Traceback (most recent call last):
File "../drivers/run_warmup.py", line 758, in
main()
File "../drivers/run_warmup.py", line 733, in main
config, tokenizer, model, configObj = load_stuff(
File "../drivers/run_warmup.py", line 312, in load_stuff
model = configObj.model_class.from_pretrained(
File "/home/coseven/anaconda3/lib/python3.8/site-packages/transformers-2.3.0-py3.8.egg/transformers/modeling_utils.py", line 432, in from_pretrained
model = load_tf2_checkpoint_in_pytorch_model(model, resolved_archive_file, allow_missing_keys=True)
File "/home/coseven/anaconda3/lib/python3.8/site-packages/transformers-2.3.0-py3.8.egg/transformers/modeling_tf_pytorch_utils.py", line 205, in load_tf2_checkpoint_in_pytorch_model
tf_model_class = getattr(transformers, tf_model_class_name)
AttributeError: module 'transformers' has no attribute 'TFRobertaDot_NLL_LN'

but my transformers version is 2.3.0,
can you help me with this ? I don't know what to do .
Wish your reply

data preprocess and inference

Hi,

I download the collectionandqueries.tar.gz, extract the files to data/msmarco/ and run

python data/msmarco_data.py --data_dir data/msmarco/ --out_data_dir data/msmarco_preprocessed --model_type rdot_nll --model_name_or_path roberta-base --max_seq_length 512 --data_type 1.

the following is returned:

...
Process Process-55:
Traceback (most recent call last):
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home3/zhangkaitao/ANCE/utils/util.py", line 339, in tokenize_to_file
    with open(in_path, 'r', encoding='utf-8') if in_path[-2:] != "gz" else gzip.open(in_path, 'rt', encoding='utf8') as in_f,\
FileNotFoundError: [Errno 2] No such file or directory: 'data/msmarco/queries.train.shuf.tsv'
start merging splits
Traceback (most recent call last):
  File "data/msmarco_data.py", line 436, in <module>
    main()
  File "data/msmarco_data.py", line 432, in main
    preprocess(args)
  File "data/msmarco_data.py", line 212, in preprocess
    "train-qrel.tsv")
  File "data/msmarco_data.py", line 66, in write_query_rel
    out_query_path, 32, 8 + 4 + args.max_query_length * 4):
  File "/home3/zhangkaitao/ANCE/utils/util.py", line 246, in numbered_byte_file_generator
    with open('{}_split{}'.format(base_path, i), 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/msmarco_preprocessed/train-query_split0'

Then I download the passage_ance_firstP_checkpoint, change the path in run_ann_data_gen.sh and run

sh run_ann_data_gen.sh

and get this:

07/10/2020 21:56:35 - WARNING - __main__ -   Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True
Traceback (most recent call last):
  File "../drivers/run_ann_data_gen.py", line 698, in <module>
    main()
  File "../drivers/run_ann_data_gen.py", line 694, in main
    ann_data_gen(args)
  File "../drivers/run_ann_data_gen.py", line 662, in ann_data_gen
    training_positive_id, dev_positive_id = load_positive_ids(args)
  File "../drivers/run_ann_data_gen.py", line 79, in load_positive_ids
    with open(query_positive_id_path, 'r', encoding='utf8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '$../data/msmarco_preprocessed/train-qrel.tsv'
07/10/2020 21:56:35 - WARNING - __main__ -   Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True
07/10/2020 21:56:35 - INFO - __main__ -   starting output number 0
07/10/2020 21:56:35 - INFO - __main__ -   Loading query_2_pos_docid
Traceback (most recent call last):
  File "../drivers/run_ann_data_gen.py", line 698, in <module>
    main()
  File "../drivers/run_ann_data_gen.py", line 694, in main
    ann_data_gen(args)
  File "../drivers/run_ann_data_gen.py", line 662, in ann_data_gen
    training_positive_id, dev_positive_id = load_positive_ids(args)
  File "../drivers/run_ann_data_gen.py", line 79, in load_positive_ids
    with open(query_positive_id_path, 'r', encoding='utf8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '$../data/msmarco_preprocessed/train-qrel.tsv'
Traceback (most recent call last):
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/zhangkaitao/.conda/envs/new/bin/python', '-u', '../drivers/run_ann_data_gen.py', '--local_rank=3', '--training_dir', '../data/msmarco/OSPass512/', '--init_model_dir', '../data/Passage_ANCE_FirstP_Checkpoint/', '--model_type', 'rdot_nll', '--output_dir', '../data/msmarco/OSPass512/ann_data/', '--cache_dir', '../data/msmarco/OSPass512/ann_data/cache/', '--data_dir', '$../data/msmarco_preprocessed/', '--max_seq_length', '512', '--per_gpu_eval_batch_size', '16', '--topk_training', '200', '--negative_sample', '20']' returned non-zero exit status 1.

Then I move the qrels.train.tsv in data/msmarco/ to the preprocessed folder, change its name to train-qrel.tsv, but it doesn't help. Finally I try

sh run_inference.sh

and get this:

07/10/2020 22:07:51 - WARNING - __main__ -   Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True
07/10/2020 22:07:51 - INFO - __main__ -   starting output number 0
07/10/2020 22:07:51 - INFO - __main__ -   Loading query_2_pos_docid
Traceback (most recent call last):
  File "../drivers/run_ann_data_gen.py", line 698, in <module>
    main()
  File "../drivers/run_ann_data_gen.py", line 694, in main
    ann_data_gen(args)
  File "../drivers/run_ann_data_gen.py", line 662, in ann_data_gen
    training_positive_id, dev_positive_id = load_positive_ids(args)
  File "../drivers/run_ann_data_gen.py", line 79, in load_positive_ids
    with open(query_positive_id_path, 'r', encoding='utf8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '$../data/msmarco_preprocessed/train-qrel.tsv'
Traceback (most recent call last):
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/zhangkaitao/.conda/envs/new/bin/python', '-u', '../drivers/run_ann_data_gen.py', '--local_rank=3', '--training_dir', '../data/Passage_ANCE_FirstP_Checkpoint/', '--init_model_dir', '../data/Passage_ANCE_FirstP_Checkpoint/', '--model_type', 'rdot_nll', '--output_dir', '../data/msmarco/OSPass512/ann_data_inf/', '--cache_dir', '../data/msmarco/OSPass512/ann_data_inf/cache/', '--data_dir', '$../data/msmarco_preprocessed/', '--max_seq_length', '512', '--per_gpu_eval_batch_size', '16', '--topk_training', '200', '--negative_sample', '20', '--end_output_num', '0', '--inference']' returned non-zero exit status 1.

Could you help me with this? Thank you:)

where is bm25 introduced?

Hi,

For the warm-up step, I see a regular dense retrieval model training on the triples.small data provided by MSMarco.

But I don't find any code introducing bm25 index and bm25 sampling.
I guess you are treating triples.small data's negatives as bm25 negs already?

What does bm25 warm up mean? How is that introduced?

Thanks

ANCE encoders

Hello, I have a question about the BERT encoders. In the paper, it is said that "ANCE can be used to train any dense retrieval model. For simplicity, we use a simple set up in recent research (Luan et al., 2020) with BERT Siamese/Dual Encoder (shared between q and d), dot product similarity, and negative log likelihood (NLL) loss." So actually, only one encoder is used to encode queries and documents separately. However, in the "model.py", the "BiEncoder" is as follows:

class BiEncoder(nn.Module):
    """ Bi-Encoder model component. Encapsulates query/question and context/passage encoders.
    """
    def __init__(self, args):
        super(BiEncoder, self).__init__()
        self.question_model = HFBertEncoder.init_encoder(args)
        self.ctx_model = HFBertEncoder.init_encoder(args)

There are two encoders are defined.

Cannot download ANCE(FirstP) checkpoint

Hey, the link provided for ANCE(FirstP) checkpoint is not working.
https://webdatamltrainingdiag842.blob.core.windows.net/semistructstore/OpenSource/Passage_ANCE_FirstP_Checkpoint.zip

Kindly provide the alternative link if available, Thanks

Download link not working

Hi, the TREc eval set query embedding and their ids provided in the README documentation cannot be downloaded, can you re-share it? Thank you!

This XML file does not appear to have any style information associated with it. The document tree is shown below.

ResourceNotFound
The specified resource does not exist. RequestId:7ad0d3db-f01e-000b-3423-611dbb000000 Time:2023-03-28T03:14:57.8906883Z

PYTHONPATH should be set explicitly

Hello,

I get this error when I follow the instructions to run the code:

  File "drivers/run_warmup.py", line 14, in <module>
    from utils.eval_mrr import passage_dist_eval
ModuleNotFoundError: No module named 'utils'

Users need to run this command before to eliminate that error:

export PYTHONPATH=${PYTHONPATH}:`pwd`

I suggest adding that command explicitly in README.md.

Issue during downloading "roberta-base-config.json"

Hey,

I have a problem with running run_warmup.py.

`During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "../drivers/run_warmup.py", line 756, in
main()
File "../drivers/run_warmup.py", line 732, in main
args.train_model_type, args)
File "../drivers/run_warmup.py", line 304, in load_stuff
cache_dir=args.cache_dir if args.cache_dir else None,
File "/scratch/h2amer/ahamsala/torch_DPR/lib/python3.6/site-packages/transformers/configuration_utils.py", line 176, in from_pretrained
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/scratch/h2amer/ahamsala/torch_DPR/lib/python3.6/site-packages/transformers/configuration_utils.py", line 243, in get_config_dict
raise EnvironmentError(msg)
OSError: Couldn't reach server at 'https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json' to download pretrained model configuration file.`

CUDA nccl library issue

Hello,

I cloned this repository because I am interested in running the run_inference.sh command. I followed the steps listed in the readme. However, when I run run_inference, I got the following error

RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:155, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

My system has NCCL v2.7.8 correctly installed with the corresponding CUDA toolkit.

What am I missing here?

thanks in advance for the help.

best,

Franco Maria

DPR checkpoint on MSMARCO?

Hello,

I recently read your wonderful ANCE paper. To the best of my knowledge, this is the only paper which included results of DPR trained on MS MARCO passage retrieval dataset. But I can only find your ANCE checkpoint in the repo.

Would you mind sharing the DPR checkpoint as well? Really appreciate your help!

microsoft / ance Goto Github PK

ance's Issues

Can't reproduce the performance of warmup(60k)

How long does inference take?

some confusion about msmarco_data.py

Using testset to test NDCG while training

Provide pretrained BM25 checkpoint for TREC Document

module 'transformers' has no attribute 'TFRobertaDot_NLL_LN'

data preprocess and inference

where is bm25 introduced?

ANCE encoders

Cannot download ANCE(FirstP) checkpoint

Download link not working

PYTHONPATH should be set explicitly

Issue during downloading "roberta-base-config.json"

CUDA nccl library issue

DPR checkpoint on MSMARCO?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	qid2offset_path = os.path.join(
	args.out_data_dir,
	"qid2offset.pickle",
	)