Giter VIP home page Giter VIP logo

taste's Introduction

OpenMatch v2

An all-in-one toolkit for information retrieval. Under active development.

Install

git clone https://github.com/OpenMatch/OpenMatch.git
cd OpenMatch
pip install -e .

-e means editable, i.e. you can change the code directly in your directory.

We do not include all the requirements in the package. You may need to manually install torch, tensorboard.

You may also need faiss for dense retrieval. You can install either faiss-cpu or faiss-gpu, according to your enviroment. Note that if you want to perform search on GPUs, you need to install the version of faiss-gpu compatible with your CUDA. In some cases (usually CUDA >= 11.0) pip installs a wrong version. If you encounter errors during search on GPUs, you may try installing it from conda.

Features

  • Human-friendly interface for dense retriever and re-ranker training and testing
  • Various PLMs supported (BERT, RoBERTa, T5...)
  • Native support for common IR & QA Datasets (MS MARCO, NQ, KILT, BEIR, ...)
  • Deep integration with Huggingface Transformers and Datasets
  • Efficient training and inference via stream-style data loading

Docs

Documentation Status

We are actively working on the docs.

Project Organizers

  • Zhiyuan Liu
  • Zhenghao Liu
  • Chenyan Xiong
  • Maosong Sun

Acknowledgments

Our implementation uses Tevatron as the starting point. We thank its authors for their contributions.

Contact

Please email to [email protected].

taste's People

Contributors

mssssss123 avatar pab1x avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

taste's Issues

Pretrained weights

Hello,

Is it possible for me to use some other pretrained model in place of the T5 model to generate the embeddings? Or is the code written in such a way that the pretrained model has to be the T5 model?

Thank you!

One bug I meet when I reproduce your work

When I reproduce your work, I meet the bug in transformers.trainer.py.
The bug is
image
The bug means 'inputs' should be a dict, but it is a list at runtime.
I check the source of the inputs and discover that it comes from eval_dataloader.
I have no idea what cause the bug. Maybe transformers version?
Can you provide some ideas to solve this bug?
Thank you very much.

LoRA with OpenDelta

Hi,

I am interested in applying LoRA to this model architecture. I see OpenDelta mentioned in the code and I am able to define a parameter efficient method as an argument in the train.sh scripts. Though OpenDelta appears to be having trouble adding LoRA to custom TASTEModel class. Have you explored adding LoRA to TASTE with any success?

Thank you!

Error when running model on beauty dataset

Hello,

I'm trying to run the model on the beauty dataset (using the configurations given in train_beauty.sh in the reproduce folder) but I seem to be getting the following error:

**Traceback (most recent call last):

File "C:\Users\anvasudev\Desktop\service contracts\OpenMatch\OpenMatch\src\TASTE\train.py", line 129, in

main()

File "C:\Users\anvasudev\Desktop\service contracts\OpenMatch\OpenMatch\src\TASTE\train.py", line 121, in main

trainer.train()

File "C:\Users\anvasudev\Anaconda3\lib\site-packages\transformers\trainer.py", line 1521, in train

return inner_training_loop(

File "C:\Users\anvasudev\Anaconda3\lib\site-packages\transformers\trainer.py", line 1837, in _inner_training_loop

self.state.epoch = epoch + (step + 1) / steps_in_epoch

ZeroDivisionError: division by zero**

According to this training message I think this is because the number of training examples is 0 but I'm not sure why since the data seems to be getting loaded in and parsed appropriately:

******* Running training *****
Num examples = 0

Num Epochs = 30

Instantaneous batch size per device = 8

Total train batch size (w. parallel, distributed & accumulation) = 8

Gradient Accumulation steps = 1

Total optimization steps = 30**

Do you know why this might be happening?

Data preprocessing

Hello,

I want to run the model on one of my own datasets, what code could I use to preprocess it into the appropriate format? I tried referring to gen_dataset_example.py but I'm not sure what I should be setting the arguments as for the run_recbole function.

Any help would be appreciated!

a running error

Hi, when I try to to reproduce the results of TASTE following the instruction in README, an error occurs after a few training steps:

  File "D:\Anaconda\envs\torch1.13\lib\site-packages\transformers\trainer.py", line 2968, in evaluate
    output = eval_loop(
  File "D:\Anaconda\envs\torch1.13\lib\site-packages\transformers\trainer.py", line 3157, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "D:\Anaconda\envs\torch1.13\lib\site-packages\transformers\trainer.py", line 3333, in prediction_step
    return_loss = inputs.get("return_loss", None)
AttributeError: 'list' object has no attribute 'get'

I use train_toys.sh from reproduce/train/toys/ without additional modifications, and this error also occurs when running other datasets. It seems "inputs" has the wrong type (list) in some cases. Can you help me?

Data clarification

Hello,
For the train.txt, valid.txt, and test.txt files for a dataset, does the 'target' column contain the ID of the most recent item purchased by a user?

Also, what is the item IDs json file supposed to contain? (In the case of the beauty datasets this file is titled "item_name.jsonl", is there any code to create this file from my own custom dataset)?

Dataset processing problem

Hi,
I am looking to reproduce your data files that you provided in your google drive (train.txt, valid.txt etc) from scratch starting from the .inter and .item files. For Amazon Toys and Games dataset, I followed your instructions by going to the DIF-SR project and swapping in the run_recbole function you provided in the gen_dataset_example.py file then ran DIF-SR.sh. I encountered an error when converting the category id (cid) around line 90 of this file where it says:

 File "/DIF-SR/recbole/quick_start/quick_start.py", line 91, in run_recbole
      cid = int(cid)

TypeError: only size-1 arrays can be converted to Python scalars

I tried removing this integer conversion line. However, the output of the item.txt is a bit different than the one you have in your google drive. The one I generated has a 0th item and a lot of padding everywhere:

Item.txt from Google Drive:

item_id	item_name	categories	brand	price	sales_rank
1	Little Red Tool Box: Magnetic Tabletop Learning Easel	'Toys & Games', 'Pretend Play', 'Dress Up & Pretend Play', 'Magnet & Felt Playboards'	Scholastic	16.19	369185
2	Dover Publications-Decorative Tile Designs Coloring Book	'Toys & Games', 'Arts & Crafts', 'Drawing & Sketch Pads', 'Drawing & Painting Supplies'	Dover Pubns	1.44	1009

My item.txt:

item_id	item_name	categories	brand	price	sales_rank
0	[PAD]	['[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]']	[PAD]	30.08	369185
1	Little Red Tool Box: Magnetic Tabletop Learning Easel	["'Toys" '&' "Games'," "'Pretend" "Play'," "'Dress" 'Up' '&' 'Pretend'
 "Play'," "'Magnet" '&' 'Felt' "Playboards'" '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]']	Scholastic	16.19	369185
2	Dover Publications-Decorative Tile Designs Coloring Book	["'Toys" '&' "Games'," "'Arts" '&' "Crafts'," "'Drawing" '&' 'Sketch'
 "Pads'," "'Drawing" '&' 'Painting' "Supplies'" '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]']	Dover Pubns	1.44	1009

Is there some step that I potentially over looked? I suspect I did not handle that cid = int(cid) line correctly

I would super appreciate your input

Need instruction on running the project

Hi,

Please give instruction on how to run the file, For someone getting started with OpenMatch it is not that easy to understand how to run the file. Went through the read me file, Could find anything regarding running the file.

About evaluation during training

The evaluation step is set to 5000 step in the model, while I follow your steps, I meet a problem in the validation during the training. I read your code and find that there is no prediction_step function in your trainer code which may result in a list error
(a dictionary is need in the prediction_step while the inputs of your model is a list). I'm wondering should i set the eval_steps to 0 to make sure there is no validtion during the training and find the best ckeckpoint by the evaluate bash.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.