Giter VIP home page Giter VIP logo

microsoft / simxns Goto Github PK

View Code? Open in Web Editor NEW
104.0 7.0 9.0 3.5 MB

SimXNS is a research project for information retrieval. This repo contains official implementations by MSRA NLC team.

Home Page: https://aka.ms/simxns

License: MIT License

Python 97.40% Shell 2.41% Jupyter Notebook 0.19%
dense-retrieval information-retrieval natural-language-processing negative-sampling knowledge-distillation pretraining-for-ir decision-time-planning

simxns's Introduction

SimXNS

✨Updates | 📜Citation | 🤘Furthermore | ❤️Contributing | 📚Trademarks

SimXNS is a research project for information retrieval by MSRA NLC team. Some of the techniques are actively used in Microsoft Bing. This repo provides the official code implementations.

Currently, this repo contains several methods that are designed for or related to information retrieval. Here are some basic descriptions to help you catch up with the characteristics of each work:

  • SimANS is a simple, general and flexible ambiguous negatives sampling method for dense text retrieval. It can be easily applied to various dense retrieval methods like AR2. This method is also applied in Bing search engine, which is proven to be effective.
  • MASTER is a multi-task pre-trained model that unifies and integrates multiple pre-training tasks with different learning objectives under the bottlenecked masked autoencoder architecture.
  • PROD is a novel distillation framework for dense retrieval, which consists of a teacher progressive distillation and a data progressive distillation to gradually improve the student.
  • CAPSTONE is a curriculum sampling for dense retrieval with document expansion, to bridge the gap between training and inference for dual-cross-encoder.
  • ALLIES leverages LLMs to iteratively generate new queries related to the original query, enabling an iterative reasoning process. By iteratively refining and expanding the scope of the original query, ALLIES captures and utilizes hidden knowledge that may not be directly obtainable through retrieval.
  • LEAD aligns the layer features of student and teacher, emphasizing more on the informative layers by re-weighting.

Updates

  • 2023/10/29: release the official code of CAPSTONE.
  • 2023/10/18: release the official code of ALLIES.
  • 2023/07/03: upload the pretrained MASTER checkpoints for MARCO and Wikipedia to huggingface model hub.
  • 2023/07/03: update approaches for downloading resources.
  • 2023/05/29: release the official code of LEAD.
  • 2023/02/16: refine the resources of SimANS by uploading files in a seperated style and offering the file list.
  • 2023/02/02: release the official code of PROD.
  • 2022/12/16: release the official code of MASTER.
  • 2022/11/17: release the official code of SimANS.

Citation

If you extend or use this work, please cite our paper where it was introduced:

  • SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval. Kun Zhou, Yeyun Gong, Xiao Liu, Wayne Xin Zhao, Yelong Shen, Anlei Dong, Jingwen Lu, Rangan Majumder, Ji-Rong Wen, Nan Duan, Weizhu Chen. EMNLP Industry Track 2022. Code, Paper.
  • MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers. Kun Zhou, Xiao Liu, Yeyun Gong, Wayne Xin Zhao, Daxin Jiang, Nan Duan, Ji-Rong Wen. ECML-PKDD 2023. Code, Paper.
  • PROD: Progressive Distillation for Dense Retrieval. Zhenghao Lin, Yeyun Gong, Xiao Liu, Hang Zhang, Chen Lin, Anlei Dong, Jian Jiao, Jingwen Lu, Daxin Jiang, Rangan Majumder, Nan Duan. WWW 2023. Code, Paper.
  • CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion. Xingwei He, Yeyun Gong, A-Long Jin, Hang Zhang, Anlei Dong, Jian Jiao, Siu Ming Yiu, Nan Duan. EMNLP 2023. Code, Paper.
  • Allies: Prompting Large Language Model with Beam Search. Hao Sun, Xiao Liu, Yeyun Gong, Yan Zhang, Daxin Jiang, Linjun Yang, Nan Duan. Findings of EMNLP 2023. Code, Paper.
  • LEAD: Liberal Feature-based Distillation for Dense Retrieval. Hao Sun, Xiao Liu, Yeyun Gong, Anlei Dong, Jian Jiao, Jingwen Lu, Yan Zhang, Daxin Jiang, Linjun Yang, Rangan Majumder, Nan Duan. WSDM 2024. Code, Paper.
@inproceedings{zhou2022simans,
   title     = {{SimANS:} Simple Ambiguous Negatives Sampling for Dense Text Retrieval},
   author    = {Kun Zhou, Yeyun Gong, Xiao Liu, Wayne Xin Zhao, Yelong Shen, Anlei Dong, Jingwen Lu, Rangan Majumder, Ji-Rong Wen, Nan Duan and Weizhu Chen},
   booktitle = {{EMNLP Industry Track}},
   year      = {2022}
}
@inproceedings{zhou2023master,
   title     = {{MASTER:} Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers},
   author    = {Kun Zhou, Xiao Liu, Yeyun Gong, Wayne Xin Zhao, Daxin Jiang, Nan Duan, Ji-Rong Wen},
   booktitle = {{ECML-PKDD}},
   year      = {2023}
}
@inproceedings{lin2023prod,
   title     = {{PROD:} Progressive Distillation for Dense Retrieval},
   author    = {Zhenghao Lin, Yeyun Gong, Xiao Liu, Hang Zhang, Chen Lin, Anlei Dong, Jian Jiao, Jingwen Lu, Daxin Jiang, Rangan Majumder and Nan Duan},
   booktitle = {{WWW}},
   year      = {2023}
}
@inproceedings{he2023capstone,
   title     = {{CAPSTONE:} Curriculum Sampling for Dense Retrieval with Document Expansion},
   author    = {Xingwei He, Yeyun Gong, A-Long Jin, Hang Zhang, Anlei Dong, Jian Jiao, Siu Ming Yiu and Nan Duan},
   booktitle = {{EMNLP}},
   year      = {2023}
}
@inproceedings{sun2023allies,
   title     = {{Allies:} Prompting Large Language Model with Beam Search},
   author    = {Hao Sun, Xiao Liu, Yeyun Gong, Yan Zhang, Daxin Jiang, Linjun Yang and Nan Duan},
   booktitle = {{Findings of EMNLP}},
   year      = {2023}
}
@inproceedings{sun2024lead,
   title     = {{LEAD:} Liberal Feature-based Distillation for Dense Retrieval},
   author    = {Hao Sun, Xiao Liu, Yeyun Gong, Anlei Dong, Jian Jiao, Jingwen Lu, Yan Zhang, Daxin Jiang, Linjun Yang, Rangan Majumder and Nan Duan},
   booktitle = {{WSDM}},
   year      = {2024}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

simxns's People

Contributors

lancelot39 avatar lx865712528 avatar nlpcode avatar sunhaonlp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

simxns's Issues

qrels.train.addition.tsv not found

Traceback (most recent call last):
File "co_training/co_training_marco_generate.py", line 401, in
main()
File "co_training/co_training_marco_generate.py", line 392, in main
get_new_dataset(args, model, global_step, renew_tools)
File "co_training/co_training_marco_generate.py", line 354, in get_new_dataset
renew_tools.get_question_topk(train_q, train_q_embed, train_q_embed2id, ground_truth_path,
File "/SimXNS/SimANS/co_training/co_training_generate.py", line 439, in get_question_topk
train_pos_qp, train_pos_qp_add = load_pos_examples(mode)
File "/SimXNS/SimANS/co_training/co_training_generate.py", line 141, in load_pos_examples
with open(file_add) as inp:
FileNotFoundError: [Errno 2] No such file or directory: 'data/MS-Pas/qrels.train.addition.tsv'

@lx865712528

SimANS key code

why there is no def SimANS(pos_pair, neg_pairs_list) in the code?

RuntimeError: CUDA error: invalid device ordinal

orch.cuda.set_device(args.local_rank)
File "/home/socialab/miniconda3/envs/IR_SimANS-env/lib/python3.10/site-packages/torch/cuda/init.py", line 314, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal

Figure1.作图疑问

Hi, 麻烦问下。

对Fig.1的梯度均值,统计的是整体loss对模型的梯度向量的L2范数吗?
对Fig.1的梯度方差,统计的是[正样本doc的loss的梯度的L2范数, 负样本doc的loss梯度的L2范数】这个向量的方差吗?

如果可以的话,可以提供一个简单的作图代码。感谢!

'ckpt/MS-Doc/adore-star'. If you were trying to load it from 'https://huggingface.co/models'

OSError: Can't load the configuration of 'ckpt/MS-Doc/adore-star'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'ckpt/MS-Doc/adore-star' is the correct path to a directory containing a config.json file
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 18826) of binary: /home/socialab/miniconda3/envs/IR_SimANS-env/bin/python

Problem with Downloading Checkpoints

I am writing to inform you that I have been facing an issue with downloading the checkpoints for the past 4 days. The download stops when it reaches 6-7 GB in size, and I am not sure if others are facing the same problem.

Could you please look into this and let me know if there is a solution or if anyone else is encountering a similar issue?

Problem with MASTER checkpoints

Hello,this is a really impressive work.
I've got some problems with the checkpoint.
image

you said that you initialized the model with bert, and I'm confused about which two layers you used to initialized the other heads,like c_head,query_head

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.