urvashik / knnlm Goto Github PK

View Code? Open in Web Editor NEW

309.0 309.0 45.0 6.28 MB

License: MIT License

Python 96.62% C++ 0.93% Cuda 2.14% Shell 0.06% Lua 0.25%

knnlm's People

Stargazers

Watchers

knnlm's Issues

Interpretability/Appendix A from paper

Hello!

I'm working with knn-lms and was interested in having a similar observation as you had in Appendix A from this repo's paper (i.e. the retrieved context and its probability). Any tips on how to do that easily? I couldn't find any hints in the paper.

Thank you!

Worse test set perplexity than expected.

What is your question?

Using the provided snippet, I am only seeing 17.80 PPL on the test set of Wikitext-103 using faiss approx. distance, although the paper reports 16.50 PPL. Am I using the correct command?

Code

python eval_lm.py data-bin/wikitext-103 \
    --path wt103_checkpoint_best.pt \
    --sample-break-mode complete --max-tokens 3072 \
    --context-window 2560 --softmax-batch 1024 \
    --gen-subset test --dstore-filename dstore_train/dstore \
    --indexfile dstore_train/knn.index  \
    --model-overrides "{'knn_keytype': 'last_ffn_input'}" \
    --k 1024 --lmbda 0.25 --dstore-size 103225485 --knn-keytype last_ffn_input \
    --probe 32 --knnlm --knn-sim-func "do_not_recomp_l2" --no-load-keys

What have you tried?

I have rebuilt the knn index twice already, but I will try again.

What's your environment?

fairseq Version (e.g., 1.0 or master): [email protected]:urvashik/knnlm.git@1c0a4e0ee29fc037b53a1449e7724af0e07dcc41#egg=fairseq
PyTorch Version: 1.7.1
OS (e.g., Linux): Linux
How you installed fairseq: pip install -e .
Build command you used (if compiling from source): Followed instructions in README
Python version: 3.6
CUDA/cuDNN version: Titan X (Pascal) and CUDA 10.1 (from nvcc -V)
GPU models and configuration: n/a
Any other relevant information: n/a

Instructions for creating wiki 3b index?

I successfully replicated the instructions from the README, and am struggling to scale to larger data size. I was wondering whether you have access to the wiki 3B index (or a path on the FAIR cluster) or instructions to create it?
Thanks!

Book Corpus train/valid/test split method

Hi, thanks for open-sourcing the inspiring knn-LM model!
I'm trying to reproduce your results on Book Corpus dataset, but I found there is no standard train/valid/test splits. Could you please describe how to split Book Corpus dataset?

Dataset link has expired

Hi!

I am replicating kNN-LM and was wondering if it would be possible to share dict.txt or the link to the original dataset. As the original link has expired and it turned out not possible to train on the dataset in the experiment.

Thank you very much for your kind help!

Question on get_knn_log_prob

Hi, I am reading the source code. In knnlm.py, there is a line (https://github.com/urvashik/knnlm/blob/master/fairseq/knnlm.py#L107):

index_mask = torch.eq(torch.from_numpy(self.vals[knns]).long().cuda().squeeze(-1), tgt[tgt != pad_idx].unsqueeze(-1)).float()

May I know what is this line used for?

Thanks!

Is sampling without building FAISS Index possible?

Hi - I went through your whitepaper and really excited about the potential applications it can bring.

I followed the readme up until building the index because I'm testing and evaluating on Colab which doesn't have 400gb sadly. I was able to load the checkpoint and evaluate the loss and perplexity without issues.

Is it possible to sample outputs given initial context like how it's described on Figure 6 of the Whitepaper?

overflow of int16

🐛 Bug

The int16 dtype (when --dstore-fp16 is activated) will cause overflow issue since int16 cannot handle all integer ids in a large vocab like Wikitext-103. This will make the vocab ids of many words negative and lead to incorrect perplexities. Thus I recommend to change all the int16 to int given that the token ids do not take much space anyway.