Giter VIP home page Giter VIP logo

centic9 / commoncrawldocumentdownload Goto Github PK

View Code? Open in Web Editor NEW
61.0 13.0 20.0 1001 KB

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

License: BSD 2-Clause "Simplified" License

Shell 0.40% Java 99.60%
cdx-files commoncrawl mime-types warc java

commoncrawldocumentdownload's Introduction

Build Status Gradle Status Release GitHub release Tag Maven Central Maven Central

This is a small tool to find matching URLs and download the corresponding binary data from the CommonCrawl indexes.

Support for the newer URL Index (http://blog.commoncrawl.org/2015/04/announcing-the-common-crawl-index/) is available, older URL Index as described at https://github.com/trivio/common_crawl_index and http://blog.commoncrawl.org/2013/01/common-crawl-url-index/ is still available in the "oldindex" package.

Please note that a full run usually finds a huge number of files and thus downloading will require a large amount of time and lots of disk-space if the data is stored locally!

NOTE This project does not implement backoff on HTTP errors about too many requests. Due to the current high rate of access by many GPT/LLM experiments, the CommonCrawl S3 bucket very often returns HTTP errors about rate exceeded. See https://github.com/tballison/commoncrawl-fetcher-lite for a newer implementation of this with more advanced functionality that work more reliably.

NOTE: CommonCrawl only stores up to 1MB per file and cuts off any bytes exceeding this length. So larger documents will be truncated and might not be valid and parsable any more. You can try to download the original file via the URL that is part of the crawl-data, but this project does not implement this due to potential "crawling" restrictions on target websites.

Getting started

Grab it

git clone https://github.com/centic9/CommonCrawlDocumentDownload.git

Build it and create the distribution files

cd CommonCrawlDocumentDownload
./gradlew check

Run it

Fetch a list of interesting documents

./gradlew lookupURLs

Reads the current Common Crawl URL index data and extracts all URLs for interesting mime-types or file extensions, stores the URLs in a file called commoncrawl-CC-MAIN-<year>-<crawl>.txt

Download documents

./gradlew downloadDocuments

Uses the URLs listed in commoncrawl-CC-MAIN-<year>-<crawl>.txt to download the documents from the Common Crawl

Deduplicate files

./gradlew deduplicate

Some files have equal content, this task will detect these based on file-size and content-hash and move all duplicates to a backup-directory to leave only unique files in place.

Deprecated: Download documents from the old-index

./gradlew downloadOldIndex

Starts downloading the URL index files from the old index and looks at each URL, downloading binary data from the common crawl archives.

The longer stuff

Change it

Run unit tests

./gradlew check jacocoTestReport

Adjust which files are found

There are a few things that you can tweak:

  • The file-extensions that are detected as download-able files are handled in the class Extensions.
  • The mime-types that are detected as download-able files isare handled in the class MimeTypes.
  • Adjust the name of the list of found files in DownloadURLIndex.COMMON_CRAWL_FILE.
  • Adjust the location where files are downloaded to in Utils.DOWNLOAD_DIR.
  • The starting file-index (of the approximately 300 cdx-files) is currently set as constant in class org.dstadler.commoncrawl.index.DownloadURLIndex, this way you can also re-start a download that was interrupted before.

Adjust which commoncrawl-index is fetched

CommonCrawl periodically runs crawls and publishes them. You can switch to newer crawls by adjusting the constant CURRENT_CRAWL in DownloadURLIndex.java to the proper <year>-<week> number of the newer crawl.

See https://commoncrawl.org/connect/blog/ for announcemnts of the latest crawls.

Ideas

  • Old Index: By adding a new implementation of BlockProcesser (likely re-using existing stuff by deriving from one of the available implementations), you can do things like streaming processing of the file instead of storing the file locally, which will avoid using too much disk-space

Estimates (based on Old Index)

  • Size of overall URL Index is 233689120776, i.e. 217GB
  • Header: 6 Bytes
  • Index-Blocks: 2644
  • Block-Size: 65536
  • => Data-Blocks: 3563169
  • Aprox. Files per Block: 2.421275
  • Resulint aprox. number of files: 8627412
  • Avg. size per file: 221613
  • Needed storage: 1911954989425 bytes = 1.7TB!

Related projects/pages

Release it

./gradlew --console=plain release && ./gradlew closeAndReleaseRepository
  • This should automatically release the new version on MavenCentral
  • Afterwards go to the Github releases page and add release-notes

Support this project

If you find this library useful and would like to support it, you can Sponsor the author

Licensing

commoncrawldocumentdownload's People

Contributors

centic9 avatar sebastian-nagel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

commoncrawldocumentdownload's Issues

Unable to download

I got following error when I tried to download. I run this command ./gradlew downloadDocuments

2017-10-30 06:35:53 INFO    [DownloadFromCommonCrawl] Downloading line 1: 0.0000%, having 0 downloaded: {"urlkey": "io,freshsales)/", "timestamp": "2017082...
Exception in thread "main" java.lang.IllegalStateException: Unknown field found: mime-detected
        at org.dstadler.commoncrawl.index.CDXItem.parse(CDXItem.java:60)
        at org.dstadler.commoncrawl.index.DownloadFromCommonCrawl.main(DownloadFromCommonCrawl.java:43)
:downloadDocuments FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':downloadDocuments'.
> Process 'command '/usr/lib/jvm/java-8-oracle/bin/java'' finished with non-zero exit value 1

Maximum file size

Hello,
This is an absolute great project. Thanks !
Is there any way to download file sizes greater than 2.0M ? We need it for our internal project.

Warm Regards !

Documentation needed

Looks like a great project, but I'm having trouble changing the Extensions.java file in order to find different extensions. I couldn't find any documentation for how to accomplish this, and when I change the file and re-run ./gradlew check I get failed most of the time.

Please add documentation for how to alter the Extensions file successfully, and also how to run the code against the oldindex instead of the new one.

Downloaded PDF files are capped at 1 MB

I am trying to download a large number of pdf files from Common Crawl. Files that are smaller than 1 MB are downloaded without issue. They appear in my local hard drive (I'm using a mac) and I can open them successfully. But for some reason none of the downloaded files are larger than 1 MB and a suspicious number of them are exactly 1 MB in size. Furthermore none of the 1 MB files can be opened. It seems like the file data is being truncated at 1 MB. Is there a size limit somewhere in the code that I should be aware of?

I don't think I am doing anything non-standard but for completeness here is the procedure I followed:

I modified the list of extensions to only download .pdf files. Then I ran ./gradlew lookupURLs. I stopped the execution after a minute or so. I checked the commoncrawl-CC-MAIN-2022-33.txt file and confirmed that there were many lines of json data.

Then I ran ./gradlew downloadDocuments and I saw pdf files start to appear in the ../download directory. I am able to open most of the downloaded pdf files that are smaller than 1 MB without any issue. However I see many pdf files that are exactly 1 MB in size and I am unable to open any of these files. However in some cases the URL for the 1 MB file in the json data still works. When I download the PDF file from the URL it is typically much larger than 1 MB.

Can you think of any reason why the downloaded PDF files being capped at 1 MB?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.