Giter VIP home page Giter VIP logo

1-billion-word-language-modeling-benchmark's Issues

meaning of cost values in output.tar

The description on the smt website and the text above the example in the README say that output.tar contains probabilities. However, the values are often greater than 1 and negatively correlate with expected probabilities. One can speculate that these are absolute values of log probabilities but the term "cost" used both in the file and in the readme is not encouraging.

Can you clarify what these values mean?

I also checked the google papers found when searching "ProdLM" language model but these do not answer it.

Some Training Data Duplicated in Heldout Data

While using the preprocessed data from http://www.statmt.org/lm-benchmark/ I noticed that some of the training data was duplicated in the heldout (aka test). This is in addition to train/news.en-00000-of-00100 which appears to be a complete copy of all the heldout data.

Using a simple python script to put the sentences into a dict, I see 303,465 unique heldout sentences and 3,223 duplicates to sentences in the training directory. Attached is a file bw_duplicates.txt with the duplicates. You can easily verify this by grep'ing for them in the training directory.

Is this a known issue? My concern is that many people use this data for benchmarking language models and the test data has about 1% of the training data mixed into it. That's probably not going to change the results much but it isn't desirable either.

question on the corpus size / script

Hi,
I got a question when looking at the "prepare data" script.
I downloaded the news.20XX.en.shuffled data from 2007 to 2011 and it does not provide 2.9B words
as said in the paper or readme page. it's far less. does this mean I need to download more monolingual data from WMT11 ? but if so it is not included in the script ?

The reason why I am asking is because I am trying to do the same thing for 2008-2015 an d I come up with 2.8B words and 2.6B words after dedup/sorting.

Also for the Interpolated KN 5-gram, was it just Srilm being used ? plain ? make big lm ?

thanks.

Dev / Test set?

I noticed that there're ~50 files under heldout-monolingual.tokenized.shuffled folder. Which ones of them is meant for test data? Is heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050 for testing while the rest of heldout-monolingual.tokenized.shuffled/news.en.heldout-00* can be used as validation set?

Dead code.google.com link

I thought I'd point out that on the page that gives a link to this Github repo (http://www.statmt.org/lm-benchmark/), there is a dead link (to the bash and perl scripts) linking to the old code.google.com home of this repo, instead of Github. The scripts are all here, but the link should be updated if possible.

CCing @phikoehn on this issue.

Data Sources for the Corpus

Hello, Thank you for this awesome work.
Can I know the source of the dataset? Maybe the website or anything else. I want to learn about how to collect text data for building large language model. If you don't mind I would like to know about the sources. Thanks in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.