1-billion-word-language-modeling-benchmark's Issues

meaning of cost values in output.tar

The description on the smt website and the text above the example in the README say that output.tar contains probabilities. However, the values are often greater than 1 and negatively correlate with expected probabilities. One can speculate that these are absolute values of log probabilities but the term "cost" used both in the file and in the readme is not encouraging.

Can you clarify what these values mean?

I also checked the google papers found when searching "ProdLM" language model but these do not answer it.

if the word not in vocab， what should I do？ or it always can't happen because the FullTokenizer

Some Training Data Duplicated in Heldout Data

While using the preprocessed data from http://www.statmt.org/lm-benchmark/ I noticed that some of the training data was duplicated in the heldout (aka test). This is in addition to train/news.en-00000-of-00100 which appears to be a complete copy of all the heldout data.

Using a simple python script to put the sentences into a dict, I see 303,465 unique heldout sentences and 3,223 duplicates to sentences in the training directory. Attached is a file bw_duplicates.txt with the duplicates. You can easily verify this by grep'ing for them in the training directory.

Is this a known issue? My concern is that many people use this data for benchmarking language models and the test data has about 1% of the training data mixed into it. That's probably not going to change the results much but it isn't desirable either.

question on the corpus size / script

Hi,
I got a question when looking at the "prepare data" script.
I downloaded the news.20XX.en.shuffled data from 2007 to 2011 and it does not provide 2.9B words
as said in the paper or readme page. it's far less. does this mean I need to download more monolingual data from WMT11 ? but if so it is not included in the script ?

The reason why I am asking is because I am trying to do the same thing for 2008-2015 an d I come up with 2.8B words and 2.6B words after dedup/sorting.

Also for the Interpolated KN 5-gram, was it just Srilm being used ? plain ? make big lm ?

thanks.

Dev / Test set?

I noticed that there're ~50 files under heldout-monolingual.tokenized.shuffled folder. Which ones of them is meant for test data? Is heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050 for testing while the rest of heldout-monolingual.tokenized.shuffled/news.en.heldout-00* can be used as validation set?

Dead code.google.com link

I thought I'd point out that on the page that gives a link to this Github repo (http://www.statmt.org/lm-benchmark/), there is a dead link (to the bash and perl scripts) linking to the old code.google.com home of this repo, instead of Github. The scripts are all here, but the link should be updated if possible.

CCing @phikoehn on this issue.

Data Sources for the Corpus

Hello, Thank you for this awesome work.
Can I know the source of the dataset? Maybe the website or anything else. I want to learn about how to collect text data for building large language model. If you don't mind I would like to know about the sources. Thanks in advance.

ciprian-chelba / 1-billion-word-language-modeling-benchmark Goto Github PK

1-billion-word-language-modeling-benchmark's Issues

meaning of cost values in output.tar

if the word not in vocab， what should I do？ or it always can't happen because the FullTokenizer

Some Training Data Duplicated in Heldout Data

question on the corpus size / script

Dev / Test set?

Dead code.google.com link

Data Sources for the Corpus

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent