Using instructions <a href="https://commoncrawl.org/2018/03/index-to-warc-files-and-ur

Failure looks like <a href="https://github.com/facebookresearch/cc_net/issues/36" data

Collect 2T tokens,about allenai/olmo

Comments (34)

rodneykinney commented on August 12, 2024 1

With 3.1B unique URLs per dump, it would take about 70GB of RAM to hash them into the same data structure used by cc_net for paragraph-level deduping. So we could do exact URL-level deduping across all dumps on a single machine.

from olmo.

rodneykinney commented on August 12, 2024

Using instructions here to get some basic statistics on overlap between different crawls based on the content_digest field.

from olmo.

rodneykinney commented on August 12, 2024

3.1B distinct values for content_digest in the latest crawl

SELECT 
count(distinct content_digest)
FROM ai2_llm.ccindex
WHERE crawl in ('CC-MAIN-2023-06')
AND subset='warc'

3128644597

6.4B for the last two crawls:

SELECT 
count(distinct content_digest)
FROM ai2_llm.ccindex
WHERE crawl in ('CC-MAIN-2023-06', 'CC-MAIN-2022-49')
AND subset='warc'

6424142394

which is just the sum of the individual distinct content_digest counts, so the digest is not useful for deduping.

from olmo.

rodneykinney commented on August 12, 2024

LLaMa uses a pipeline called cc_net

from olmo.

dirkgr commented on August 12, 2024

I have three instances running in AWS that are downloading the most recent three checkpoints from CC: https://us-east-1.console.aws.amazon.com/ec2/home?region=us-east-1#Instances:v=3;$case=tags:true%5C,client:false;$regex=tags:false%5C,client:false

They don't have proper AI2 users configured. You can log in as the ubuntu user using the key that's stored under the name "Dirk's Key" or something like that. The AllenNLP AWS account is pretty barren, so everything is easy to find.

The original C4 code starts here: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/c4.py#L506
It uses Apache Beam, so it's all written in this Apache Beam style.

PileV2 spreadsheet is here: https://docs.google.com/spreadsheets/d/19IAFhqRvhRxdUj-df8PUOBI2W8aEqGmJmBcZvXOuDZY/edit#gid=0
All the Reddit dumps point to a download location. I have not tried to see what happens when you download from there. For one thing, I don't know if you get the Reddit threads already straightened out, or if this is the raw version before any cleaning. Also, Eleuther being Eleuther, they are filtering toxic subreddits before they make it into the model. I don't think we should do that, but we should know how much toxic content there is. Our model needs to see some toxic content, so it can be used to filter later, but it should not see an overwhelming amount.

from olmo.

rodneykinney commented on August 12, 2024

Steps to install CCNet on AMI ami-0d70546e43a941d70:

sudo apt install cmake
sudo apt install build-essential libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libboost-test-dev
make install
pip install cc_net[getpy]

from olmo.

rodneykinney commented on August 12, 2024

First snapshot processing failed overall, but did leave some partial output. It produces json-lines files segmented by language:

$ ls mined_split/2019-09/1581/ | head -10
af_all.json.gz
af_all.json.gz.index
als_all.json.gz
als_all.json.gz.index
am_all.json.gz
am_all.json.gz.index
an_all.json.gz
an_all.json.gz.index
ar_all.json.gz
ar_all.json.gz.index

Sample line from the en output:

{
  "url": "http://1019therock.com/couple-and-mother-charged-in-ludlow-meth-bust/",
  "date_download": "2019-02-24T04:11:06Z",
  "digest": "sha1:LVY5PMQCUPDAGSFETJH2N2HIKGBOJSV4",
  "length": 1548,
  "nlines": 10,
  "source_domain": "1019therock.com",
  "title": "Couple and Mother Charged in Ludlow Meth Bust",
  "raw_content": "Couple and Mother Charged in Ludlow Meth Bust\nFor the second time in less than eight months, a southern Aroostook couple has been arrested on methamphetamine charges, and the woman's mother has also been charged.\nThe arrests came after Maine Drug Enforcement Agents say they found the makings of a meth lab inside a remote cabin in Ludlow, just west of Houlton, according to Public Safety department spokesman Steve McCausland. Agents were conducting a bail check Tuesday afternoon in relation to the charges from June 2015 when they made the discovery,\nAroostook County Sheriff’s Deputies and drug agents charged 31-year-old James Anthony, 26-year-old Kayla Nason, along with Nason’s mother, 48-year-old Tara Walton.\nThe three were arrested at the cabin on Townline Road Tuesday and charged with trafficking in methamphetamine and were taken to the Aroostook County Jail, McCausland said. Anthony and Walton were also charged with violating their bail conditions.\nThe MDEA’s meth lab response team was working at the cabin in Ludlow Wednesday to gather evidence and dispose of the dangerous and explosive chemicals.\nLast June, Anthony and Nason were arrested after sheriff’s deputies found the two were cooking meth inside their car on the Ludlow Road in Ludlow. Nason at the time was treated and released for chemical burns as a result to her exposure to the methamphetamine.\nThis is the 12th meth related incident in Maine this year, McCausland said.\nNEXT: Presque Isle Woman Arrested in Alleged Arson Fire\nFiled Under: Aroostook, arrest, Ludlow",
  "cc_segment": "crawl-data/CC-MAIN-2019-09/segments/1550249578748.86/wet/CC-MAIN-20190224023850-20190224045850-00520.warc.wet.gz",
  "original_nlines": 122,
  "original_length": 3275,
  "line_ids": [
    85,
    90,
    92,
    93,
    94,
    95,
    96,
    97,
    98,
    99
  ],
  "language": "en",
  "language_score": 0.98,
  "bucket": "all"
}

I interpret this to mean that the original doc had 122 lines, and only 10 remained after de-duping.

from olmo.

rodneykinney commented on August 12, 2024

Failure looks like this issue

from olmo.

rodneykinney commented on August 12, 2024

The line-level deduping does a great job cleaning up the text. Here is the original document:

Couple and Mother Charged in Ludlow Meth Bust
What's Hot:
High School Basketball
Krazy Jake Live
Red Sox Road Trip
Community Spotlight
Jobs With Us
Patriots News
Newsletter
The Rock Mobile App
The Rock on Alexa
Maine News
New Brunswick News
Listen Live
Live In Concert
Golf Cards
Deals
Celtics Bus Trip
Quebec Winter Carnival
Pick 'Em 2018
Patriots Schedule
Sign In
Home
On Air
Full Schedule
Dick Palm
McKenzie Rae
Ultimate Classic Rock
Live In Concert
News on the Rock
Mark Shaw
Listen
Listen Live
Mobile App
Rock Squad
Pick 'Em 2018
Join Now
Rock Newsletter
Contests
Playlist
Events
Krazy Jake Live
Red Sox Road Trip
SOLD OUT: Celtics Bus Trip
SOLD OUT: Quebec Winter Carnival
Deals
Win Stuff
Contact
Help & Contact
Send Feedback
Advertise
Jobs With Us
More
Home
On Air
Full Schedule
Dick Palm
McKenzie Rae
Ultimate Classic Rock
Live In Concert
News on the Rock
Mark Shaw
Listen
Listen Live
Mobile App
Rock Squad
Pick 'Em 2018
Join Now
Rock Newsletter
Contests
Playlist
Events
Krazy Jake Live
Red Sox Road Trip
SOLD OUT: Celtics Bus Trip
SOLD OUT: Quebec Winter Carnival
Deals
Win Stuff
Contact
Help & Contact
Send Feedback
Advertise
Jobs With Us
Listen Now
101.9 The Rock101.9 The Rock
INSTAGRAM
Couple and Mother Charged in Ludlow Meth Bust
Mark Shaw
MDEA
Share on Twitter
Share on Facebook
For the second time in less than eight months, a southern Aroostook couple has been arrested on methamphetamine charges, and the woman's mother has also been charged.
MDEA
The arrests came after Maine Drug Enforcement Agents say they found the makings of a meth lab inside a remote cabin in Ludlow, just west of Houlton, according to Public Safety department spokesman Steve McCausland. Agents were conducting a bail check Tuesday afternoon in relation to the charges from June 2015 when they made the discovery,
Aroostook County Sheriff’s Deputies and drug agents charged 31-year-old James Anthony, 26-year-old Kayla Nason, along with Nason’s mother, 48-year-old Tara Walton.
The three were arrested at the cabin on Townline Road Tuesday and charged with trafficking in methamphetamine and were taken to the Aroostook County Jail, McCausland said. Anthony and Walton were also charged with violating their bail conditions.
The MDEA’s meth lab response team was working at the cabin in Ludlow Wednesday to gather evidence and dispose of the dangerous and explosive chemicals.
Last June, Anthony and Nason were arrested after sheriff’s deputies found the two were cooking meth inside their car on the Ludlow Road in Ludlow. Nason at the time was treated and released for chemical burns as a result to her exposure to the methamphetamine.
This is the 12th meth related incident in Maine this year, McCausland said.
NEXT: Presque Isle Woman Arrested in Alleged Arson Fire
Filed Under: Aroostook, arrest, Ludlow
Categories: Local News, Maine News
Comments
Leave A Comment
Back To Top
Featured
Patriots' Owner Robert Kraft Charged In Prostitution Sting
Recommended for You
Information
Loudwire Network
EEO
Marketing and Advertising Solutions
Public File
Report an Inaccuracy
Terms
VIP Terms
FAQ
Contest Rules
Privacy Policy (Updated: 12/14/18)
Contact
Business Listings
Follow Us
2019 101.9 The Rock is part of the Loudwire Network, Townsquare Media, Inc. All rights reserved.

After de-duping, it looks like this:

Couple and Mother Charged in Ludlow Meth Bust
For the second time in less than eight months, a southern Aroostook couple has been arrested on methamphetamine charges, and the woman's mother has also been charged.
The arrests came after Maine Drug Enforcement Agents say they found the makings of a meth lab inside a remote cabin in Ludlow, just west of Houlton, according to Public Safety department spokesman Steve McCausland. Agents were conducting a bail check Tuesday afternoon in relation to the charges from June 2015 when they made the discovery,
Aroostook County Sheriff’s Deputies and drug agents charged 31-year-old James Anthony, 26-year-old Kayla Nason, along with Nason’s mother, 48-year-old Tara Walton.
The three were arrested at the cabin on Townline Road Tuesday and charged with trafficking in methamphetamine and were taken to the Aroostook County Jail, McCausland said. Anthony and Walton were also charged with violating their bail conditions.
The MDEA’s meth lab response team was working at the cabin in Ludlow Wednesday to gather evidence and dispose of the dangerous and explosive chemicals.
Last June, Anthony and Nason were arrested after sheriff’s deputies found the two were cooking meth inside their car on the Ludlow Road in Ludlow. Nason at the time was treated and released for chemical burns as a result to her exposure to the methamphetamine.
This is the 12th meth related incident in Maine this year, McCausland said.
NEXT: Presque Isle Woman Arrested in Alleged Arson Fire
Filed Under: Aroostook, arrest, Ludlow

from olmo.

rodneykinney commented on August 12, 2024

A snapshot is divided into 1590 shards. Here's a token count for English-classified documents from a single shard of the 2019-09 snapshot:

$ gunzip --stdout ./0718/en_all.json.gz | jq '.raw_content' --raw-output | tr -cd ' \n' | wc -c
270894893

That would give us approximately 430B English tokens for the entire snapshot.

from olmo.

dirkgr commented on August 12, 2024

How do the token counts fall off when we add more snapshots?

from olmo.

dirkgr commented on August 12, 2024

Ah, also, we've been counting tokens in the other data sources using the unicode universal tokenizer. https://uniseg-py.readthedocs.io/en/latest/index.html is a Python version, but there is a version for C++ and Rust at least. For English it might not make a big difference, but it will for the other languages.

from olmo.

dirkgr commented on August 12, 2024

This is how I count tokens using uniseg: https://github.com/allenai/c5/blob/main/wet_path_to_pages.py#L17

from olmo.

rodneykinney commented on August 12, 2024

How do the token counts fall off when we add more snapshots?

The CCNET paper asserts "There is little content overlap between monthly snapshots" without explicitly computing the drop-off. In practical terms, you don't have enough RAM to fully dedupe even a single snapshot. They do find that the token counts start to flatten out even below 10% of a single snapshot.

https://www.semanticscholar.org/paper/CCNet%3A-Extracting-High-Quality-Monolingual-Datasets-Wenzek-Lachaux/c20c68c45127439139a08adb0b1f2b8354a94d6c/figure/6

The RAM requirements for deduping are shown here:

https://www.semanticscholar.org/paper/CCNet%3A-Extracting-High-Quality-Monolingual-Datasets-Wenzek-Lachaux/c20c68c45127439139a08adb0b1f2b8354a94d6c/figure/7

They settled on 3% of hashes used for deduping, although to my eyes, even using 1.5% is a pretty good trade-off. The overall process is RAM-bound, so you can double throughput by using the 1.5% threshold.

from olmo.

dirkgr commented on August 12, 2024

Not sure how to interpret those graphs. Does that say that after de-duping a single snapshot, we should expect less than 30% of the original content to remain? The fact that it flattens out is also confusing. As we add more data, novelty increases? Why would this happen?

The fact that we can't de-dupe even a single snapshot this way seems problematic. You know, we could write the O(n) BLOOM filter deduplication step in Scala or Java as well. And in fact, in a language that makes threads easy and fast like that, maybe we could bake in some other tricks.

from olmo.

rodneykinney commented on August 12, 2024

Yes, those graphs are saying that you are left with only 30% of the content after deduping each line with a random 1% sampling of other lines. It means that most of the content consists of lines that are repeated over and over. It makes total sense when you look at the example of the original and de-duped document I pasted above.

from olmo.

dirkgr commented on August 12, 2024

Is it measuring by number of paragraphs removed, or number of characters? It makes sense that small paragraphs (1-2 words) would be duplicated a lot.

…

On Thu, Mar 9, 2023, 12:25 Rodney Kinney ***@***.***> wrote: Yes, those graphs are saying that you are left with only 30% of the content after deduping each line with a random 1% sampling of other lines. It means that most of the content consists of lines that are repeated over and over. It makes total sense when you look at the example of the original and de-duped document I pasted above <#1 (comment)>. — Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHAYPU7LWV65JZP4QFEHPLW3I4D7ANCNFSM6AAAAAAVCNR5WQ> . You are receiving this because you commented.Message ID: ***@***.***>

from olmo.

rodneykinney commented on August 12, 2024

Is it measuring by number of paragraphs removed, or number of characters?

Those are characters.

from olmo.

rodneykinney commented on August 12, 2024

I have the pipeline tuned and running end-to-end. I've uploaded some sample data to s3://ai2-llm/pretraining-data/sources/common-crawl/samples/2019-09

The data is split by language. For each language, we have the option to split it up by perplexity buckets (head, middle, tail). For simplicity, I'm inclined to do this for English only.

The process is memory bound, at a cost of 20GB/thread, using a 1.5% sampling rate for deduping. An m6a.48xlarge has 768GB ram, so it can run ~35 threads. A single snapshot from 2019 yields about 400B tokens. The CCNET authors estimate it takes 5000 CPU hours to process. My own benchmarking suggests closer to 2500. That would be about 75 instance hours, or a dollar cost of $625. More recent snapshots are presumably larger. The cost per token will be the same if you keep the RAM usage at 20GB/thread, but will go up if you maintain the 1.5% sampling rate.

from olmo.

rodneykinney commented on August 12, 2024

Running on a u-3tb1 gives you more RAM per CPU, so the wall-clock time and dollar cost would be lower, about 17 instance hours and $450.

from olmo.

rodneykinney commented on August 12, 2024

Completed a run on a single snapshot to my satisfaction. Not uploading the full data to S3, but preserving it in this snapshot I will tweak the configuration and start systematically processing snapshots next week.

from olmo.

rodneykinney commented on August 12, 2024

Within a single dump, there is < 1% duplication by URL:

SELECT bucket, count(*)
FROM (
SELECT url,
CASE WHEN count = 1 THEN '1'
WHEN count < 6 THEN '2-5'
WHEN count < 11 THEN '6-10'
WHEN count < 21 THEN '11-20'
WHEN count < 51 THEN '21-50'
ELSE '51+' END AS bucket
FROM
(
SELECT url, count(1) as count
FROM ccindex
WHERE crawl = 'CC-MAIN-2023-06' 
  AND subset = 'warc'
  GROUP by url
) url_counts
) url_buckets
GROUP BY bucket

Occurrences	URLs
1	3158028434
2-5	13162713
6-10	185912
11-20	55896
21-50	23288
51+	7628

from olmo.

kyleclo commented on August 12, 2024

@rodneykinney under sampling of exact URL matches, does the text look highly similar?

from olmo.

rodneykinney commented on August 12, 2024

Within a single dump, there is < 1% duplication by URL

Athena timed out running the same query across multiple dumps

from olmo.

rodneykinney commented on August 12, 2024

Observations on using bff for paragraph-level deduping:

Runs fine on server machine. Run-time is about 2x the merger: 100 CPU hours per CC dump. I used a 150GB Bloom Filter, which has an estimated 0.3% false-positive rate for 100B n-grams. (One dump has ~500B tokens over ~3B documents).

Unfortunately, even though the false-positive rate is small, the rate of duplication is also small. Given that a paragraph was removed by the filter, the odds are about even that it was an actual duplicate vs. a false positive. I looked through examples of paragraphs that would have been removed, and the only examples I saw that were not false positives were duplicated within the document itself.

Given the cost, and the unknown effects of removing even < 1% of paragraphs at random, I don't think we should do probabilistic paragraph-level deduping. We should consider within-document exact paragraph-level deduping.

Exact URL deduping is tractable, but we don't have code that will do it. The CCNet code would only work single-threaded. Rust has a concurrent hash set, so we could implement it. We could also make minor modifications to bff to do probabilistic url deduping. It would run much faster: no tokenization, only one thing to hash. Dropping a complete document due to a false positive is better than dropping a paragraph because it doesn't affect the text's coherence. Because we would be sending fewer things through the filter, we could also make the false-positive rate much smaller.

from olmo.

dirkgr commented on August 12, 2024

We can also make the false positive rate smaller by using a bigger filter. 150GB is not very big.

…

On Fri, Apr 7, 2023, 15:27 Rodney Kinney ***@***.***> wrote: Observations on using bff <https://github.com/allenai/bff> for paragraph-level deduping: Runs fine on server machine. Run-time is about 2x the merger: 100 CPU hours per CC dump. I used a 150GB Bloom Filter, which has an estimated 0.3% false-positive rate for 100B n-grams. (One dump has ~500B tokens over ~3B documents). Unfortunately, even though the false-positive rate is small, the rate of duplication is also small. Given that a paragraph was removed by the filter, the odds are about even that it was an actual duplicate vs. a false positive. I looked through examples of paragraphs that would have been removed, and the only examples I saw that were not false positives were duplicated within the document itself. Given the cost, and the unknown effects of removing even < 1% of paragraphs at random, I don't think we should do probabilistic paragraph-level deduping. We should consider within-document exact paragraph-level deduping. Exact URL deduping is tractable, but we don't have code that will do it. The CCNet code would only work single-threaded. Rust has a concurrent hash set <https://docs.rs/flurry/latest/flurry/struct.HashSet.html>, so we could implement it. We could also make minor modifications to bff to do probabilistic url deduping. It would run much faster: no tokenization, only one thing to hash. Dropping a complete document due to a false positive is better than dropping a paragraph because it doesn't affect the text's coherence. Because we would be sending fewer things through the filter, we could also make the false-positive rate much smaller. — Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHAYPXRCJ4NU5A5H3SQ7W3XACIEXANCNFSM6AAAAAAVCNR5WQ> . You are receiving this because you commented.Message ID: ***@***.***>

from olmo.

dirkgr commented on August 12, 2024

Wait, the 0.3% false positive rate is per ngram. But a paragraph needs to have 80% of it's ngrams come up positive to be removed. That should result in a way lower false positive rate for paragraphs. If you're seeing 0.3 per paragraph, something is up.

…

On Fri, Apr 7, 2023, 18:53 Dirk Groeneveld ***@***.***> wrote: We can also make the false positive rate smaller by using a bigger filter. 150GB is not very big. On Fri, Apr 7, 2023, 15:27 Rodney Kinney ***@***.***> wrote: > Observations on using bff <https://github.com/allenai/bff> for > paragraph-level deduping: > > Runs fine on server machine. Run-time is about 2x the merger: 100 CPU > hours per CC dump. I used a 150GB Bloom Filter, which has an estimated 0.3% > false-positive rate for 100B n-grams. (One dump has ~500B tokens over ~3B > documents). > > Unfortunately, even though the false-positive rate is small, the rate of > duplication is also small. Given that a paragraph was removed by the > filter, the odds are about even that it was an actual duplicate vs. a false > positive. I looked through examples of paragraphs that would have been > removed, and the only examples I saw that were not false positives were > duplicated within the document itself. > > Given the cost, and the unknown effects of removing even < 1% of > paragraphs at random, I don't think we should do probabilistic > paragraph-level deduping. We should consider within-document exact > paragraph-level deduping. > > Exact URL deduping is tractable, but we don't have code that will do it. > The CCNet code would only work single-threaded. Rust has a concurrent > hash set <https://docs.rs/flurry/latest/flurry/struct.HashSet.html>, so > we could implement it. We could also make minor modifications to bff to > do probabilistic url deduping. It would run much faster: no tokenization, > only one thing to hash. Dropping a complete document due to a false > positive is better than dropping a paragraph because it doesn't affect the > text's coherence. Because we would be sending fewer things through the > filter, we could also make the false-positive rate much smaller. > > — > Reply to this email directly, view it on GitHub > <#1 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAHAYPXRCJ4NU5A5H3SQ7W3XACIEXANCNFSM6AAAAAAVCNR5WQ> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

from olmo.

dirkgr commented on August 12, 2024

One more thought: The false positive rate it shows the rate at the end of the filtering, i.e., for the last ngram it puts in. For the first ngram the false positive probability is 0, and then it slowly goes up for most of the process, until it goes up a lot towards the end, as the filter fills up.

from olmo.

rodneykinney commented on August 12, 2024

From the analysis, there's about 1% rate of duplication by URL in a dump. Paragraph-level deduping is probably not the right way to handle these even if the error rate were zero. At best, we'd end up removing the relevant content from the duplicates, leaving behind a junk shell. Using the Bloom Filter to dedupe by URL will simply drop the dupes and is orders of magnitude faster. I've got a branch with a modified bff that I will test out.

from olmo.

rodneykinney commented on August 12, 2024

Deduped two combined dumps by URL. Number of removed documents was still ~1%, suggesting little overlap between dumps.

from olmo.

dirkgr commented on August 12, 2024

I deduped one of the dumps that came out of the C5 repo, and it removed over 30% of the data. Where does the difference come from?

from olmo.

rodneykinney commented on August 12, 2024

The deduping I'm running now is after the deduping already done by the CCNet code, which isn't exhaustive, but does remove a lot of the content. #1 (comment)

from olmo.

rodneykinney commented on August 12, 2024

Here's some data on the duplication rate across CC dumps.

Using Dirk's bloom filter to discard documents with a URL that have been seen before, here is the fraction of documents that are retained as we stream over 25 dumps, going backwards from the most recent.

The fraction of unseen URLs flattens out at about 30-40%, so each dump does continue to contribute distinct content. I would expect this to continue if we process more of them.

from olmo.

rodneykinney commented on August 12, 2024

Uploaded 25 URL-deduped dumps into s3://ai2-llm/pretraining-data/sources/common-crawl/v1/documents

100% English
Compressed size is 11 TB
High/Mid/Low fluency split is 20/25/55 %
# of documents: ~3B
# of tokens: 4.8T
# of characters: ~30T

from olmo.

Collect 2T tokens about olmo HOT 34 CLOSED

Comments (34)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent