Giter VIP home page Giter VIP logo

cdx-index-client's Issues

Error

I get the error

Traceback (most recent call last):
  File "cdx-index-client.py", line 382, in <module>
    main()
  File "cdx-index-client.py", line 379, in main
    read_index(r, info['cdx-api'], info['id'])
  File "cdx-index-client.py", line 292, in read_index
    num_pages = get_num_pages(api_url, r.url, r.page_size)
  File "cdx-index-client.py", line 33, in get_num_pages
    pages_info = r.json()
  File "/root/commoncrawl-env-py3/lib/python3.6/site-packages/requests/models.py", line 896, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

when using the command ./cdx-index-client.py -c all http://skcript.com/* --fl url -d crawled

JSONDecodeError

The below error is raised when running the example command python2 cdx-index-client.py -c CC-MAIN-2015-06 http://iana.org/

Traceback (most recent call last):
  File "cdx-index-client.py", line 382, in <module>
    main()
  File "cdx-index-client.py", line 379, in main
    read_index(r, info['cdx-api'], info['id'])
  File "cdx-index-client.py", line 292, in read_index
    num_pages = get_num_pages(api_url, r.url, r.page_size)
  File "cdx-index-client.py", line 33, in get_num_pages
    pages_info = r.json()
  File "/usr/lib/python2.7/dist-packages/requests/models.py", line 897, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python2.7/dist-packages/simplejson/__init__.py", line 518, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/dist-packages/simplejson/decoder.py", line 370, in decode
    obj, end = self.raw_decode(s)
  File "/usr/lib/python2.7/dist-packages/simplejson/decoder.py", line 400, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

new -c flag

I just tried running ./cdx-index-client.py using this example from the README.md with an up to date git clone

./cdx-index-client.py -c CC-MAIN-2015-06 *.io --show-num-pages

It returns:

./cdx-index-client.py -c CC-MAIN-2015-06 *.io --show-num-pages
usage: CDX Index API Client [-h] [-n] [-p PROCESSES] [--fl FL] [-j] [-z]
                            [-o OUTPUT_PREFIX] [--page-size PAGE_SIZE]
                            [--cdx-server-url CDX_SERVER_URL]
                            [--timeout TIMEOUT] [--max-retries MAX_RETRIES]
                            [-v] [--pages [PAGES [PAGES ...]]]
                            [--header [HEADER [HEADER ...]]] [--in-order]
                            [collection] url
CDX Index API Client: error: unrecognized arguments: -c

I checked script and there is no parser.add_argument() for -c. The last commit added -c to the README.md and that was it. Did the documentation get updated before the code?

SyntaxError: Missing parentheses in call to 'print'

i@scheherezade:/opt/cdx-index-client$ sudo -E -H pip install -r requirements.txt 
[sudo] password for i: 
Requirement already satisfied: requests in /usr/local/lib/python3.5/site-packages (from -r requirements.txt (line 1))
Collecting beautifulsoup (from -r requirements.txt (line 2))
  Using cached BeautifulSoup-3.2.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/user/1001/pip-build-3kgqmdp4/beautifulsoup/setup.py", line 22
        print "Unit tests have failed!"
                                      ^
    SyntaxError: Missing parentheses in call to 'print'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/user/1001/pip-build-3kgqmdp4/beautifulsoup/

get an WARC archive with all files from a domain

I'd like to download all pages from the www.ipc.com domain in a WARC archive file (or several files). so I do as follows:

$ ./cdx-index-client.py -c CC-MAIN-2015-06 http://www.ipc.com/
$ cat www.ipc.com-0
com,ipc)/ 20150127054500 {"url": "http://www.ipc.com/", "digest": "2WIVV4MGIEL27MAOOREEEKCIATEK43GM", "length": "9953", "offset": "768421563", "filename": "crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz"}
[...]

$ wget https://commoncrawl.s3.amazonaws.com:/crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz
$ gunzip -k CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz
$ cat CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc | tail -c +768421563 | head -c 9953 >segment1.warc

here, I would expect to get some WARC entries of www.ipc.com, but I get a "random" trunk of the input file.

No matching distribution found for urlparse

$ pip install -r requirements.txt
Collecting requests (from -r requirements.txt (line 1))
...
  Could not find a version that satisfies the requirement urlparse (from -r requirements.txt (line 3)) (from versions: )
No matching distribution found for urlparse (from -r requirements.txt (line 3))

Python 2.7.10 (default, Oct 6 2017, 22:29:07)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.