albert-ma / prop Goto Github PK

View Code? Open in Web Editor NEW

110.0 4.0 18.0 1.35 MB

WSDM'2021, PROP and SIGIR'2021,B-PROP

License: Apache License 2.0

Python 98.56% Shell 1.44%

pretrain-for-search prop b-prop representative-words-prediction pretraining-for-ir

prop's People

Stargazers

Watchers

Forkers

kiminh canyuchen yyht zsweet butterluo silencio94 alifeline createrll valerieps xuanyuan14 anlewy monottsong ict-bigdatalab fight4newworld trellixvulnteam davidmrau deu05232

prop's Issues

Questions about your fine-tuning process.

Hi, I recently read your new paper B-PROP in SIGIR'21. I'm trying to take your methods as baselines. I have some questions about your fine-tuning process as follows:

Did you also apply linear_warm_up and linear_decay strategy?
What kind of loss did you adopt? Pair-wise hinge loss or cross-entropy loss or others? For each relevant doc of a query, how many negative docs did you sample?
For the two big datasets, the paper assigns the batch_size to 144. Does it mean 144 (query, pos_doc, neg_doc) or 144 (query, doc) {which means 72(query, pos_doc, neg_doc)s}?

Thanks for any possible help.

fine-tuning code

hello,When will the fine-tuning code be released?

Question about your baseline Transformer_ICT

Hi, I'm very interested in your work and I am now following your experimental setups.
Is Transformer_ICT implemented by yourself? Could you please provide the pre-trained model of it?
Thank you.

Fail to download the pretrained model

Hi, this is a great work! However, I failed to download the pretrained model. The given link points to an empty folder. Is the pretrained model available yet?

Question about the stem2word file

Hi, I'm not clear how to produce the stem2word file, can you explain what this file should be produced？Thanks a lot !

Questions about text preprocessing.

Hello,
What should I do when cleaning the texts? Should I remove all the numbers, equations and punctuation marks?
Could you provide the codes or a function for text preprocessing?

发布的模型文件可以直接使用huggingface transformers来Load吗？

谢谢

Do you truncate the token length to 512 for PROP

Hello,
I see your paper that for vanilla bert, you truncate the token length to 512.
And what about PROP, is PROP the same architecture as vanilla bert and truncate token length to 512，
or PROP seperate long documents to many parts？

questions about scripts

Hello,
Is this the code you used in your paper?
It looks like an older version , and errors occur even when running multiprocessing_generate_word_sets.py with proper data.
Did you check whether the code works with the data ?

Question about your mlm mask

I notice that in the pretrainng process, the mlm words do not be masked? Is this a bug of this version code?

数据预处理脚本问题

当执行./scripts/process.sh命令时

INPUT_FILE=./data/wiki_info
Bert_MODEL_DIR=../bert-base-uncased-py（PROP目录下没有bert-base-uncased文件啊，这是怎么回事呀？）

python -m prop.preprocessing_data
--corpus_name wikipedia
--data_file ${INPUT_FILE}/wiki_info/wiki_toy.data \ （q1:是不是多写了一个wiki_info，q2:wiki_toy.data难道不是我们的输入数据么，然后根据这个输入数据生成相应的json文件）
--bert_model ${Bert_MODEL_DIR}
--do_lower_case
--output_dir ${INPUT_FILE}/wiki_info/

albert-ma / prop Goto Github PK

prop's People

Stargazers

Watchers

Forkers

prop's Issues

Questions about your fine-tuning process.

fine-tuning code

Question about your baseline Transformer_ICT

Fail to download the pretrained model

Question about the stem2word file

Questions about text preprocessing.

请问会公开预训练好的B-PROP模型吗？

发布的模型文件可以直接使用huggingface transformers来Load吗？

Do you truncate the token length to 512 for PROP

questions about scripts

Question about your mlm mask

数据预处理脚本问题

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent