urvashik / knnlm Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Hello!
I'm working with knn-lms and was interested in having a similar observation as you had in Appendix A from this repo's paper (i.e. the retrieved context and its probability). Any tips on how to do that easily? I couldn't find any hints in the paper.
Thank you!
Using the provided snippet, I am only seeing 17.80 PPL on the test set of Wikitext-103 using faiss approx. distance, although the paper reports 16.50 PPL. Am I using the correct command?
python eval_lm.py data-bin/wikitext-103 \
--path wt103_checkpoint_best.pt \
--sample-break-mode complete --max-tokens 3072 \
--context-window 2560 --softmax-batch 1024 \
--gen-subset test --dstore-filename dstore_train/dstore \
--indexfile dstore_train/knn.index \
--model-overrides "{'knn_keytype': 'last_ffn_input'}" \
--k 1024 --lmbda 0.25 --dstore-size 103225485 --knn-keytype last_ffn_input \
--probe 32 --knnlm --knn-sim-func "do_not_recomp_l2" --no-load-keys
I have rebuilt the knn index twice already, but I will try again.
[email protected]:urvashik/knnlm.git@1c0a4e0ee29fc037b53a1449e7724af0e07dcc41#egg=fairseq
pip install -e .
nvcc -V
)I successfully replicated the instructions from the README, and am struggling to scale to larger data size. I was wondering whether you have access to the wiki 3B index (or a path on the FAIR cluster) or instructions to create it?
Thanks!
Hi, thanks for open-sourcing the inspiring knn-LM model!
I'm trying to reproduce your results on Book Corpus dataset, but I found there is no standard train/valid/test splits. Could you please describe how to split Book Corpus dataset?
Hi!
I am replicating kNN-LM and was wondering if it would be possible to share dict.txt or the link to the original dataset. As the original link has expired and it turned out not possible to train on the dataset in the experiment.
Thank you very much for your kind help!
Hi, I am reading the source code. In knnlm.py
, there is a line (https://github.com/urvashik/knnlm/blob/master/fairseq/knnlm.py#L107):
index_mask = torch.eq(torch.from_numpy(self.vals[knns]).long().cuda().squeeze(-1), tgt[tgt != pad_idx].unsqueeze(-1)).float()
May I know what is this line used for?
Thanks!
Hi - I went through your whitepaper and really excited about the potential applications it can bring.
I followed the readme up until building the index because I'm testing and evaluating on Colab which doesn't have 400gb sadly. I was able to load the checkpoint and evaluate the loss and perplexity without issues.
Is it possible to sample outputs given initial context like how it's described on Figure 6 of the Whitepaper?
The int16
dtype (when --dstore-fp16
is activated) will cause overflow issue since int16 cannot handle all integer ids in a large vocab like Wikitext-103. This will make the vocab ids of many words negative and lead to incorrect perplexities. Thus I recommend to change all the int16
to int
given that the token ids do not take much space anyway.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.