There appear to be inconsistencies between the vocabulary contained within vocab.txt used for GloVe embeddings and the actual vocabulary of the Spider dataset. In terms of unique words, there is no correlation between vocab.txt and the amount of unique words seen in only questions, only queries and questions + queries combined from the Spider dataset.
Furthermore, when I order the contents of each by frequencies, vocab.txt does not appear to match frequencies of the Spider dataset either.
Since my investigations have yielded no conclusions, what is the meaning of vocab.txt? Where do the words and frequencies actually come from, if not the Spider dataset?
Note, I downloaded vocab.txt from the download link on the Code page: https://drive.google.com/file/d/1L8sWlp7J9LWjw9MP2bHGsf0wC4xLAyxO/view