ikreymer / cdx-index-client Goto Github PK
View Code? Open in Web Editor NEWA command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
License: MIT License
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
License: MIT License
I get the error
Traceback (most recent call last):
File "cdx-index-client.py", line 382, in <module>
main()
File "cdx-index-client.py", line 379, in main
read_index(r, info['cdx-api'], info['id'])
File "cdx-index-client.py", line 292, in read_index
num_pages = get_num_pages(api_url, r.url, r.page_size)
File "cdx-index-client.py", line 33, in get_num_pages
pages_info = r.json()
File "/root/commoncrawl-env-py3/lib/python3.6/site-packages/requests/models.py", line 896, in json
return complexjson.loads(self.text, **kwargs)
File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
when using the command ./cdx-index-client.py -c all http://skcript.com/* --fl url -d crawled
I just tried running ./cdx-index-client.py using this example from the README.md with an up to date git clone
./cdx-index-client.py -c CC-MAIN-2015-06 *.io --show-num-pages
It returns:
./cdx-index-client.py -c CC-MAIN-2015-06 *.io --show-num-pages
usage: CDX Index API Client [-h] [-n] [-p PROCESSES] [--fl FL] [-j] [-z]
[-o OUTPUT_PREFIX] [--page-size PAGE_SIZE]
[--cdx-server-url CDX_SERVER_URL]
[--timeout TIMEOUT] [--max-retries MAX_RETRIES]
[-v] [--pages [PAGES [PAGES ...]]]
[--header [HEADER [HEADER ...]]] [--in-order]
[collection] url
CDX Index API Client: error: unrecognized arguments: -c
I checked script and there is no parser.add_argument() for -c. The last commit added -c to the README.md and that was it. Did the documentation get updated before the code?
$ pip install -r requirements.txt
Collecting requests (from -r requirements.txt (line 1))
...
Could not find a version that satisfies the requirement urlparse (from -r requirements.txt (line 3)) (from versions: )
No matching distribution found for urlparse (from -r requirements.txt (line 3))
Python 2.7.10 (default, Oct 6 2017, 22:29:07)
i@scheherezade:/opt/cdx-index-client$ sudo -E -H pip install -r requirements.txt
[sudo] password for i:
Requirement already satisfied: requests in /usr/local/lib/python3.5/site-packages (from -r requirements.txt (line 1))
Collecting beautifulsoup (from -r requirements.txt (line 2))
Using cached BeautifulSoup-3.2.1.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/user/1001/pip-build-3kgqmdp4/beautifulsoup/setup.py", line 22
print "Unit tests have failed!"
^
SyntaxError: Missing parentheses in call to 'print'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/user/1001/pip-build-3kgqmdp4/beautifulsoup/
The below error is raised when running the example command python2 cdx-index-client.py -c CC-MAIN-2015-06 http://iana.org/
Traceback (most recent call last):
File "cdx-index-client.py", line 382, in <module>
main()
File "cdx-index-client.py", line 379, in main
read_index(r, info['cdx-api'], info['id'])
File "cdx-index-client.py", line 292, in read_index
num_pages = get_num_pages(api_url, r.url, r.page_size)
File "cdx-index-client.py", line 33, in get_num_pages
pages_info = r.json()
File "/usr/lib/python2.7/dist-packages/requests/models.py", line 897, in json
return complexjson.loads(self.text, **kwargs)
File "/usr/lib/python2.7/dist-packages/simplejson/__init__.py", line 518, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/dist-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/usr/lib/python2.7/dist-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I'd like to download all pages from the www.ipc.com domain in a WARC archive file (or several files). so I do as follows:
$ ./cdx-index-client.py -c CC-MAIN-2015-06 http://www.ipc.com/
$ cat www.ipc.com-0
com,ipc)/ 20150127054500 {"url": "http://www.ipc.com/", "digest": "2WIVV4MGIEL27MAOOREEEKCIATEK43GM", "length": "9953", "offset": "768421563", "filename": "crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz"}
[...]
$ wget https://commoncrawl.s3.amazonaws.com:/crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz
$ gunzip -k CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz
$ cat CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc | tail -c +768421563 | head -c 9953 >segment1.warc
here, I would expect to get some WARC entries of www.ipc.com, but I get a "random" trunk of the input file.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.