openmatch / taste Goto Github PK

[CIKM 2023] This is the code repo for our CIKM‘23 paper "Text Matching Improves Sequential Recommendation by Reducing Popularity Biases".

License: MIT License

Python 88.90% Shell 11.10%

information-retrieval nlp recommender-system

taste's Introduction

OpenMatch v2

An all-in-one toolkit for information retrieval. Under active development.

Install

git clone https://github.com/OpenMatch/OpenMatch.git
cd OpenMatch
pip install -e .

-e means editable, i.e. you can change the code directly in your directory.

We do not include all the requirements in the package. You may need to manually install torch, tensorboard.

You may also need faiss for dense retrieval. You can install either faiss-cpu or faiss-gpu, according to your enviroment. Note that if you want to perform search on GPUs, you need to install the version of faiss-gpu compatible with your CUDA. In some cases (usually CUDA >= 11.0) pip installs a wrong version. If you encounter errors during search on GPUs, you may try installing it from conda.

Features

Human-friendly interface for dense retriever and re-ranker training and testing
Various PLMs supported (BERT, RoBERTa, T5...)
Native support for common IR & QA Datasets (MS MARCO, NQ, KILT, BEIR, ...)
Deep integration with Huggingface Transformers and Datasets
Efficient training and inference via stream-style data loading

Docs

We are actively working on the docs.

Project Organizers

Zhiyuan Liu
- Tsinghua University
- Homepage
Zhenghao Liu
- Northeastern University
- Homepage
Chenyan Xiong
- Microsoft Research AI
- Homepage
Maosong Sun
- Tsinghua University
- Homepage

Acknowledgments

Our implementation uses Tevatron as the starting point. We thank its authors for their contributions.

Contact

Please email to [email protected].

taste's People

Contributors

Stargazers

Watchers

Forkers

ssusantachary mssssss123 lihuibng njuhugn

taste's Issues

Pretrained weights

Hello,

Is it possible for me to use some other pretrained model in place of the T5 model to generate the embeddings? Or is the code written in such a way that the pretrained model has to be the T5 model?

Thank you!

One bug I meet when I reproduce your work

When I reproduce your work, I meet the bug in transformers.trainer.py.
The bug is

The bug means 'inputs' should be a dict, but it is a list at runtime.
I check the source of the inputs and discover that it comes from eval_dataloader.
I have no idea what cause the bug. Maybe transformers version?
Can you provide some ideas to solve this bug?
Thank you very much.

LoRA with OpenDelta

Hi,

I am interested in applying LoRA to this model architecture. I see OpenDelta mentioned in the code and I am able to define a parameter efficient method as an argument in the train.sh scripts. Though OpenDelta appears to be having trouble adding LoRA to custom TASTEModel class. Have you explored adding LoRA to TASTE with any success?

Thank you!

Error when running model on beauty dataset

Hello,

I'm trying to run the model on the beauty dataset (using the configurations given in train_beauty.sh in the reproduce folder) but I seem to be getting the following error:

**Traceback (most recent call last):

File "C:\Users\anvasudev\Desktop\service contracts\OpenMatch\OpenMatch\src\TASTE\train.py", line 129, in

main()

File "C:\Users\anvasudev\Desktop\service contracts\OpenMatch\OpenMatch\src\TASTE\train.py", line 121, in main

trainer.train()

File "C:\Users\anvasudev\Anaconda3\lib\site-packages\transformers\trainer.py", line 1521, in train

return inner_training_loop(

File "C:\Users\anvasudev\Anaconda3\lib\site-packages\transformers\trainer.py", line 1837, in _inner_training_loop

self.state.epoch = epoch + (step + 1) / steps_in_epoch

ZeroDivisionError: division by zero**

According to this training message I think this is because the number of training examples is 0 but I'm not sure why since the data seems to be getting loaded in and parsed appropriately:

******* Running training *****
Num examples = 0

Num Epochs = 30

Instantaneous batch size per device = 8

Total train batch size (w. parallel, distributed & accumulation) = 8

Gradient Accumulation steps = 1

Total optimization steps = 30**

Do you know why this might be happening?

Data preprocessing

Hello,

I want to run the model on one of my own datasets, what code could I use to preprocess it into the appropriate format? I tried referring to gen_dataset_example.py but I'm not sure what I should be setting the arguments as for the run_recbole function.

Any help would be appreciated!

computational resources and training time of TASTE

Hi, I'm interested in the computational resources required to train TASTE, as well as the time needed for a single training run. Could you please provide some insights on this? Thanks.

a running error

Hi, when I try to to reproduce the results of TASTE following the instruction in README, an error occurs after a few training steps:

  File "D:\Anaconda\envs\torch1.13\lib\site-packages\transformers\trainer.py", line 2968, in evaluate
    output = eval_loop(
  File "D:\Anaconda\envs\torch1.13\lib\site-packages\transformers\trainer.py", line 3157, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "D:\Anaconda\envs\torch1.13\lib\site-packages\transformers\trainer.py", line 3333, in prediction_step
    return_loss = inputs.get("return_loss", None)
AttributeError: 'list' object has no attribute 'get'

I use train_toys.sh from reproduce/train/toys/ without additional modifications, and this error also occurs when running other datasets. It seems "inputs" has the wrong type (list) in some cases. Can you help me?

Data clarification

Hello,
For the train.txt, valid.txt, and test.txt files for a dataset, does the 'target' column contain the ID of the most recent item purchased by a user?

Also, what is the item IDs json file supposed to contain? (In the case of the beauty datasets this file is titled "item_name.jsonl", is there any code to create this file from my own custom dataset)?

Dataset processing problem

Hi,
I am looking to reproduce your data files that you provided in your google drive (train.txt, valid.txt etc) from scratch starting from the .inter and .item files. For Amazon Toys and Games dataset, I followed your instructions by going to the DIF-SR project and swapping in the run_recbole function you provided in the gen_dataset_example.py file then ran DIF-SR.sh. I encountered an error when converting the category id (cid) around line 90 of this file where it says:

 File "/DIF-SR/recbole/quick_start/quick_start.py", line 91, in run_recbole
      cid = int(cid)

TypeError: only size-1 arrays can be converted to Python scalars

I tried removing this integer conversion line. However, the output of the item.txt is a bit different than the one you have in your google drive. The one I generated has a 0th item and a lot of padding everywhere:

Item.txt from Google Drive:

item_id	item_name	categories	brand	price	sales_rank
1	Little Red Tool Box: Magnetic Tabletop Learning Easel	'Toys & Games', 'Pretend Play', 'Dress Up & Pretend Play', 'Magnet & Felt Playboards'	Scholastic	16.19	369185
2	Dover Publications-Decorative Tile Designs Coloring Book	'Toys & Games', 'Arts & Crafts', 'Drawing & Sketch Pads', 'Drawing & Painting Supplies'	Dover Pubns	1.44	1009

My item.txt:

item_id	item_name	categories	brand	price	sales_rank
0	[PAD]	['[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]']	[PAD]	30.08	369185
1	Little Red Tool Box: Magnetic Tabletop Learning Easel	["'Toys" '&' "Games'," "'Pretend" "Play'," "'Dress" 'Up' '&' 'Pretend'
 "Play'," "'Magnet" '&' 'Felt' "Playboards'" '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]']	Scholastic	16.19	369185
2	Dover Publications-Decorative Tile Designs Coloring Book	["'Toys" '&' "Games'," "'Arts" '&' "Crafts'," "'Drawing" '&' 'Sketch'
 "Pads'," "'Drawing" '&' 'Painting' "Supplies'" '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]']	Dover Pubns	1.44	1009

Is there some step that I potentially over looked? I suspect I did not handle that cid = int(cid) line correctly

I would super appreciate your input

Need instruction on running the project

Hi,

Please give instruction on how to run the file, For someone getting started with OpenMatch it is not that easy to understand how to run the file. Went through the read me file, Could find anything regarding running the file.

About evaluation during training

The evaluation step is set to 5000 step in the model, while I follow your steps, I meet a problem in the validation during the training. I read your code and find that there is no prediction_step function in your trainer code which may result in a list error
(a dictionary is need in the prediction_step while the inputs of your model is a list). I'm wondering should i set the eval_steps to 0 to make sure there is no validtion during the training and find the best ckeckpoint by the evaluate bash.

openmatch / taste Goto Github PK

taste's Introduction

OpenMatch v2

Install

Features

Docs

Project Organizers

Acknowledgments

Contact

taste's People

Contributors

Stargazers

Watchers

Forkers

taste's Issues

Recommend Projects

Recommend Topics

Recommend Org