Giter VIP home page Giter VIP logo

clearcode-toolkit's Introduction

ClearCode toolkit

ClearCode is a simple tool to fetch and sync ClearlyDefined data for a local copy.

ClearlyDefined data are organized as deeply nested trees of JSON files.

The data that are synchronized include with this tool include:

  • the "definitions" that contain the aggregated data from running multiple scan tools and if available a manual expert curation
  • the "harvests" that contain the actual detailed output of scans (e.g. scancode runs)

The items are not fetched for now:

  • the "attachments" that are whole original files such as a README file
  • the "deadletters" that are scan failure traces when things fail: these are not available through the API

Here are some stats on the ClearlyDefined data files set as of 2020-02-26, excluding "deadletters" and most attachments:

ย  JSON Files Directories Files & Dirs Gzipped Size On disk
ScanCode scans 9,087,479 29,052,667 38,140,146 139,754,303,291 ~400 GB
Defs. & misc. 38,796,760 44,825,854 83,622,614 304,861,913,800 ~1 TB
Total 47,884,239 73,878,521 121,762,760 444,616,217,091 ~2 TB

Such a large number of files breaks about any filesystem: a mere directory listing can take days to complete. To avoid these file size and number issues, the JSON data fetched from the ClearlyDefined API are stored as gzipped-compressed JSON as blobs in a PosgresSQL database keyed by the file path. That path is the same as the path used in the ClearlyDefined "blob" storage on Azure. You can also save these as real files gzipped-compressed JSON files (with the caveat that this will make the filesystem crumble and this may require a special mkfs invocation to create a filesystems with enough inodes.)

Requirements

To run this tool, you need:

  • a POSIX OS (Linux)
  • Python 3.6+
  • PosgresSQL 9.5+ if you want to handle a large dataset
  • plenty of space, bandwidth and CPU.

Quick start using a simple file storage

Run these commands to get started:

$ source configure
$ clearsync --help

For instance, try this command:

$ clearsync --output-dir clearly-local --verbose -n3

This will fetch continuously everything (definitions, harvests, etc). from ClearlyDefined using three processes in parallel and save the output as JSON files in the clearly-local directory.

You can abort this at anytime with Ctrl+C.

WARNING: this may ceate too many files and directory for your file system sanity. Consider using the PostgreSQL storage instead.

Quick start using a database storage

First create a PostgreSQL database. This requires sudo access. This is tested on Debian and Ubuntu.

$ ./createdb.sh

Then run these commands to get started:

$ source configure
$ clearsync --help

For instance, try this command:

$ clearsync --save-to-db  --verbose -n3

This will fetch all the latest data items and save them in the "clearcode" PostgresDB using three processes in parallel for fetching. You can abort this at anytime with Ctrl+C.

Basic tests can be run with the following command:

$ ./manage.py test clearcode --verbosity=2

Using the Rest API and webserver to import and export items from ClearCode

This assumes you have already populated your database even partially. In a first shell, start the webserver:

$ source configure
$ ./manage.py runserver

You can then visit the API at http://127.0.0.1:8000/api/

In a second shell, you can run the command line API client tool to export data fetched from ClearlyDefined:

$ source configure
$ python etc/scripts/clearcode-api-backup.py \
  --api-root-url http://127.0.0.1:8000/api/ \
  --last-modified-date 2020-06-20

Starting backup from http://127.0.0.1:8000/api/
Collecting cditems...
821 total
[...........................................................................]
821 cditems collected.
Backup location: /etc/scripts/clearcode_backup_2020-06-23_00-30-22
Backup completed.

The exported backup is saved as a single JSON file in a directory created for this run named with a timestamp such as clearcode_backup_2020-06-22_21-04-48.

In that second shell, you can then run the command line API client tool to import data saved from the export/backup run above:

$ python etc/scripts/clearcode-api-import.py \
  --clearcode-target-api-url http://127.0.0.1:8000/api/ \
  --backup-directory etc/scripts/clearcode_backup_2020-06-23_00-30-22/

Importing objects from ../etc/scripts/clearcode_backup_2020-06-23_00-30-22 to http://127.0.0.1:8000/api/
Copying 821 cditems...........................................Copy completed.
Results saved in /etc/scripts/copy_results_2020-06-23_00-32-37.json

This would likely something you would run on an isolated ClearCode DB that you want to keep current with items exported from a live replicating DB.

Note that these tools have minimal external requirements: only the requests library and have been designed to be used as single files that can be copied around.

See also for help on these two utilities:

$ python etc/scripts/clearcode-api-backup.py -h
$ python etc/scripts/clearcode-api-import.py -h

Support

Enter a ticket with bugs, issues or questions at https://github.com/nexB/clearcode-toolkit/

And join us to chat on Gitter (also by IRC) at https://gitter.im/aboutcode-org/discuss

Release TODO

  • Merge in master and tag release.
  • pip install wheel twine
  • rm dist/*
  • python setup.py release
  • twine upload dist/*

License

Apache-2.0

clearcode-toolkit's People

Contributors

dennisclark avatar dependabot[bot] avatar jonoyang avatar maxhbr avatar mjherzog avatar pombredanne avatar steven-esser avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clearcode-toolkit's Issues

Also synchronize or process file "attachments"

We do not sync attachments for now. These are typical top level key files in a package that ClearlyDefined mirrors and fetch a copy of directly in their database. These are problematic because they represent million plain text files that are serialized to JSON and that do exist in the original package otherwise.
These attachments may represent a significant size and have little direct value.
If we really want them there are two options:

  1. fetch and sync them as the other CDItems (and migrate them to the DB as we already have fetched the biggest majority of these)
  2. use pointer to an external source such as SWH

Some definitions are not fetched correctly from ClearlyDefined and are empty

There a few definitions that are not fetched correctly from ClearlyDefined and are stored empty.

We need to track all these oddities as they seem to be mostly API errors that silently do not return a correct value but instead an empty payload with no http error. An example of this is this:
https://api.clearlydefined.io/definitions/npm/npmjs/@zxing/ngx-scanner/1.0.0-dev.json

The symptom in the ClearCode DB is a mostly empty payload that is still there as a gzip-compressed but fails to deserialize from JSON since this is empty.

There are two things to do there:

  1. one time fixup of the data: catch the ones that we have in the DB and refetch them from the CD API.
  2. permanent fix is to check when we receive a payload that we can load it as JSON and that it is not empty (or only white spaces). If that happens we need to wait 30 to 60 seconds and retry the fetch. If this still fails we could have a boolean flag that this fetch needs to be restarted later and/or add an error for this.

Note that most likely cause may be the Cloudflare front that CD uses which is at best capricious.
A possibility to consider to work around these is to have multiple IPs and hosts to fetch things from.

@MaJuRG how many such empty CDitems do we have roughly?

When running 'clearsync', --max-defs argument doesn't work

When running clearsync, the --max-defs command line argument doesn't work. This is because of a logic error here:

if max_def and max_def >= cycle_defs_count:

                    if max_def and max_def >= cycle_defs_count:
                        break

                if max_def and (max_def >= cycle_defs_count or max_def >= total_defs_count):
                    break

should be

                    if max_def and max_def <= cycle_defs_count:
                        break

                if max_def and (max_def <= cycle_defs_count or max_def <= total_defs_count):
                    break

Clearcode toolkit connection breaks

I am running clearcode with this https://github.com/nexB/clearcode-toolkit#quick-start-using-a-database-storage, It stops in between due to breaking of the connection
Python - 3.7
OS - Debian GNU/Linux
Postgres - PostgreSQL 12.5

Saved 0 defs and harvests, in: 4 sec.
TOTAL cycles: 5603 with: 161972 defs and combined harvests, in: 6611 sec.
Cycle completed at: 2021-02-20T13:28:29.950845 Sleeping for 60 seconds...
Fetched definitions from : npm/npmjs/@popperjs/core/2.8.0 to: npm/npmjs/@fluentui/react/7.161.0
TOTAL cycles: 5604 with: 161972 defs and combined harvests, in: 6611 sec.
Traceback (most recent call last):
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.ProtocolViolation: server conn crashed?
SSL connection has been closed unexpectedly
 
 
The above exception was the direct cause of the following exception:
 
Traceback (most recent call last):
  File "/home/tg1999/clearcode-toolkit/bin/clearsync", line 11, in <module>
    load_entry_point('clearcode-toolkit', 'console_scripts', 'clearsync')()
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/tg1999/clearcode-toolkit/src/clearcode/sync.py", line 514, in cli
    for coordinate, file_path in definitions:
  File "/home/tg1999/clearcode-toolkit/src/clearcode/sync.py", line 137, in fetch_and_save_latest_definitions
    saver=saver)
  File "/home/tg1999/clearcode-toolkit/src/clearcode/sync.py", line 258, in save_def
    return blob_path, saver(content=content, output_dir=output_dir, blob_path=blob_path)
  File "/home/tg1999/clearcode-toolkit/src/clearcode/sync.py", line 234, in db_saver
    cditem, created = models.CDitem.objects.get_or_create(path=blob_path)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/query.py", line 573, in get_or_create
    return self.get(**kwargs), False
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/query.py", line 425, in get
    num = len(clone)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/query.py", line 269, in __len__
    self._fetch_all()
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/query.py", line 1308, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/query.py", line 53, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1156, in execute_sql
    cursor.execute(sql, params)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/home/tg1999/clearcode-toolkit/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.DatabaseError: server conn crashed?
SSL connection has been closed unexpectedly

Make the cdutils.py usable as a standalone

I'd like to use the functionality of cdutils.py externally

I think the architecture of this module could be improve to make it portable and usable outside of the whole clearcode-toolkit context.
First, the module should not depend on any external libraries (except package-url for the related features). attr, click, requests should not be requirements for this low-level functionalities.

Add script to populate the initial DB

We need a script (for possibly a one time usage) that will:

  1. walk the directory tree that contains all the (million++) files already fetched
  2. save them in the DB

Add script to export and import CDitems updated after a certain date

We need to add a script to export all CDitems that were add/updated after a given date. Ideally, we would want to export this data in JSON format, streamed via JSON lines.

However, JSON streaming will not work as we store binary data in the content field. We will need to find an alternative method that will:

  1. serialize binary data
  2. preferably stream that data in a segmented fashion

Two possible solutions are pickle, which is apart of the standard library OR the protobuf library. pickle does not support streaming like a JSON-lines solution would, and I have yet to look deep into protobuf

Not able to get any data to map CD definition with SWH

{'described': {'releaseDate': '2014-07-16', 'tools': ['scancode/3.2.2'], 'toolScore': {'total': 30, 'date': 30, 'source': 0}, 'score': {'total': 30, 'date': 30, 'source': 0}}, 'coordinates': {'type': 'sourcearchive', 'provider': 'mavencentral', 'namespace': 'za.co.monadic', 'name': 'scopus_2.10', 'revision': '0.1.5'}, 'licensed': {'toolScore': {'total': 0, 'declared': 0, 'discovered': 0, 'consistency': 0, 'spdx': 0, 'texts': 0}, 'facets': {'core': {'attribution': {'unknown': 6, 'parties': ['Copyright David Weber 2014']}, 'discovered': {'unknown': 11}, 'files': 11}}, 'score': {'total': 0, 'declared': 0, 'discovered': 0, 'consistency': 0, 'spdx': 0, 'texts': 0}}, '_meta': {'schemaVersion': '1.6.1', 'updated': '2019-05-11T20:31:18.538Z'}, 'scores': {'effective': 15, 'tool': 15}}

I got this CD definition from clearcode toolkit DB in clearcode_cditem, I was expecting some data like hashes or sourcelocation to map this data with swh, I was able to get some kind of data to map these definitions with swh, but was not able to find any, so wanted to ask why its different than others?

Index data from ClearCode for matching

Changes need to be made to ClearCode to facilitate indexing.

A few changes to start would be:

  • Adding last_map_date and is_mappable fields to the CDitem model
    • This is to help our matching tools know which CDitems that can be mapped or if it has already processed and looked at

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.