Giter VIP home page Giter VIP logo

cdx-index-client's Introduction

CommonCrawl Index Client (CDX Index API Client)

A simple python command line tool for retrieving a list of urls in bulk using the CommonCrawl Index API at http://index.commoncrawl.org (or any other web archive CDX Server API).

Examples

The tool takes advantage of the CDX Server Pagination API and the Python multiprocessing support to load pages (chunks) of a large url index in parallel.

This may be especially useful for prefix/domain extraction.

To use, first install dependencies: pip install -r requirements.txt (The script has been tested on Python 2.7.x)

For example, fetch all entries in the index for url http://iana.org/ from index CC-MAIN-2015-06, run: ./cdx-index-client.py -c CC-MAIN-2015-06 http://iana.org/

It is often good idea to check how big the dataset is: ./cdx-index-client.py -c CC-MAIN-2015-06 *.io --show-num-pages

will print the number of pages that will be fetched to get a list of urls in the '*.io' domain.

This will give a relative size of the query. A query with thousands of pages may take a long time!

Then, you might fetch a list of urls from the index which are part of the *.io domain, as follows:

./cdx-index-client.py -c CC-MAIN-2015-06 *.io --fl url -z

The -fl flag specifies that only the url should be fetched, instead of the entire index row.

The -z flag indicates to store the data compressed.

For the above query, the output will be stored in domain-io-N.gz where for each page N (padded to number of digits)

Usage Options

Below is the current list of options, also available by running ./cdx-index-client.py -h

usage: CDX Index API Client [-h] [-n] [-p PROCESSES] [--fl FL] [-j] [-z]
                            [-o OUTPUT_PREFIX] [-d DIRECTORY]
                            [--page-size PAGE_SIZE]
                            [-c COLL | --cdx-server-url CDX_SERVER_URL]
                            [--timeout TIMEOUT] [--max-retries MAX_RETRIES]
                            [-v] [--pages [PAGES [PAGES ...]]]
                            [--header [HEADER [HEADER ...]]] [--in-order]
                            url

positional arguments:
  url                   url to query in the index: For prefix, use:
                        http://example.com/* For domain query, use:
                        *.example.com

optional arguments:
  -h, --help            show this help message and exit
  -n, --show-num-pages  Show Number of Pages only and exit
  -p PROCESSES, --processes PROCESSES
                        Number of worker processes to use
  --fl FL               select fields to include: eg, --fl url,timestamp
  -j, --json            Use json output instead of cdx(j)
  -z, --gzipped         Storge gzipped results, with .gz extensions
  -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
                        Custom output prefix, append with -NN for each page
  -d DIRECTORY, --directory DIRECTORY
                        Specify custom output directory
  --page-size PAGE_SIZE
                        size of each page in blocks, >=1
  -c COLL, --coll COLL  The index collection to use or "all" to use all
                        available indexes. The default value is the most
                        recent available index
  --cdx-server-url CDX_SERVER_URL
                        Set endpoint for CDX Server API
  --timeout TIMEOUT     HTTP read timeout before retry
  --max-retries MAX_RETRIES
                        Number of retry attempts
  -v, --verbose         Verbose logging of debug msgs
  --pages [PAGES [PAGES ...]]
                        Get only the specified result page(s) instead of all
                        results
  --header [HEADER [HEADER ...]]
                        Add custom header to request
  --in-order            Fetch pages in order (default is to shuffle page list)

Additional Use Cases

While this tool was designed specifically for use with the index at http://index.commoncrawl.org, it can also be used with any other running cdx server, including pywb, OpenWayback and IA Wayback.

The client uses a common subset of pywb CDX Server API and the original IA Wayback CDX Server API and so should work with either of these tools.

To specify a custom api endpoint, simply use the --cdx-server-url flag. For example, to connect to a locally running server, you can run:

./cdx-index-client.py example.com/* --cdx-server-url http://localhost:8080/pywb-cdx

cdx-index-client's People

Contributors

ddacunha avatar ikreymer avatar s-nt-s avatar samueltc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cdx-index-client's Issues

Error

I get the error

Traceback (most recent call last):
  File "cdx-index-client.py", line 382, in <module>
    main()
  File "cdx-index-client.py", line 379, in main
    read_index(r, info['cdx-api'], info['id'])
  File "cdx-index-client.py", line 292, in read_index
    num_pages = get_num_pages(api_url, r.url, r.page_size)
  File "cdx-index-client.py", line 33, in get_num_pages
    pages_info = r.json()
  File "/root/commoncrawl-env-py3/lib/python3.6/site-packages/requests/models.py", line 896, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

when using the command ./cdx-index-client.py -c all http://skcript.com/* --fl url -d crawled

get an WARC archive with all files from a domain

I'd like to download all pages from the www.ipc.com domain in a WARC archive file (or several files). so I do as follows:

$ ./cdx-index-client.py -c CC-MAIN-2015-06 http://www.ipc.com/
$ cat www.ipc.com-0
com,ipc)/ 20150127054500 {"url": "http://www.ipc.com/", "digest": "2WIVV4MGIEL27MAOOREEEKCIATEK43GM", "length": "9953", "offset": "768421563", "filename": "crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz"}
[...]

$ wget https://commoncrawl.s3.amazonaws.com:/crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz
$ gunzip -k CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz
$ cat CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc | tail -c +768421563 | head -c 9953 >segment1.warc

here, I would expect to get some WARC entries of www.ipc.com, but I get a "random" trunk of the input file.

No matching distribution found for urlparse

$ pip install -r requirements.txt
Collecting requests (from -r requirements.txt (line 1))
...
  Could not find a version that satisfies the requirement urlparse (from -r requirements.txt (line 3)) (from versions: )
No matching distribution found for urlparse (from -r requirements.txt (line 3))

Python 2.7.10 (default, Oct 6 2017, 22:29:07)

new -c flag

I just tried running ./cdx-index-client.py using this example from the README.md with an up to date git clone

./cdx-index-client.py -c CC-MAIN-2015-06 *.io --show-num-pages

It returns:

./cdx-index-client.py -c CC-MAIN-2015-06 *.io --show-num-pages
usage: CDX Index API Client [-h] [-n] [-p PROCESSES] [--fl FL] [-j] [-z]
                            [-o OUTPUT_PREFIX] [--page-size PAGE_SIZE]
                            [--cdx-server-url CDX_SERVER_URL]
                            [--timeout TIMEOUT] [--max-retries MAX_RETRIES]
                            [-v] [--pages [PAGES [PAGES ...]]]
                            [--header [HEADER [HEADER ...]]] [--in-order]
                            [collection] url
CDX Index API Client: error: unrecognized arguments: -c

I checked script and there is no parser.add_argument() for -c. The last commit added -c to the README.md and that was it. Did the documentation get updated before the code?

JSONDecodeError

The below error is raised when running the example command python2 cdx-index-client.py -c CC-MAIN-2015-06 http://iana.org/

Traceback (most recent call last):
  File "cdx-index-client.py", line 382, in <module>
    main()
  File "cdx-index-client.py", line 379, in main
    read_index(r, info['cdx-api'], info['id'])
  File "cdx-index-client.py", line 292, in read_index
    num_pages = get_num_pages(api_url, r.url, r.page_size)
  File "cdx-index-client.py", line 33, in get_num_pages
    pages_info = r.json()
  File "/usr/lib/python2.7/dist-packages/requests/models.py", line 897, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python2.7/dist-packages/simplejson/__init__.py", line 518, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/dist-packages/simplejson/decoder.py", line 370, in decode
    obj, end = self.raw_decode(s)
  File "/usr/lib/python2.7/dist-packages/simplejson/decoder.py", line 400, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

SyntaxError: Missing parentheses in call to 'print'

i@scheherezade:/opt/cdx-index-client$ sudo -E -H pip install -r requirements.txt 
[sudo] password for i: 
Requirement already satisfied: requests in /usr/local/lib/python3.5/site-packages (from -r requirements.txt (line 1))
Collecting beautifulsoup (from -r requirements.txt (line 2))
  Using cached BeautifulSoup-3.2.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/user/1001/pip-build-3kgqmdp4/beautifulsoup/setup.py", line 22
        print "Unit tests have failed!"
                                      ^
    SyntaxError: Missing parentheses in call to 'print'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/user/1001/pip-build-3kgqmdp4/beautifulsoup/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.