arche's Introduction

Arche

The library helps to verify data using set of defined rules, for example:

Validation with JSON schema
Coverage
Duplicates
Garbage symbols
Comparison of two jobs

We use it in Scrapinghub

At the moment the tool supports only Scrapy Cloud jobs data as an input. The core libraries are pandas, plotly and jsonschema.

Use case

You need to perform QA on Scrapy Cloud jobs continuously.

Say, you scraped some website and have the data ready in the cloud. A typical approach would be:
- Create a JSON schema and validate the scrapped data with it
- Use the created schema in Spidermon Validation
You want to use it in your application to verify Scrapy Cloud data

Usage

The library is intented to work in Jupyter environment and has its own plain text report module. It's assumed that:

You have the library installed there with all dependencies
SH_APIKEY is set up

A simple example will look like this:

from arche import Arche
g = Arche(source="112358/13/21")
g.report_all()
g.data_quality_report()

The outcome of executed rules will be printed, along with some fancy graphs

Developer Setup

pipenv install --dev
pipenv shell
tox

Developer Usage

The library consists of two core modules - arche.rules and arche.report. If you wish to just use the rules and implement reporting yourself, here's one example of usage:

import arche.rules.duplicates as dup_rules
result = dup_rules.check_uniqueness(df, tagged_fields)

Each rule returns arche.rules.result.Result object which can be parsed however you like.

Documentation

wiki

Contribution

Any contributions, no matter how minor, are welcome!

Fork or create a new branch
Make desired changes
Open a pull request

To update docs, better check tox.ini docs section.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.

Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

TensorFlow

An Open Source Machine Learning Framework for Everyone

Django

The Web framework for perfectionists with deadlines.

Laravel

A PHP framework for web artisans

D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

web

Some thing interesting about web. New door for the world.

server

A server is a program made to process requests and deliver data to clients.

Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

Visualization

Some thing interesting about visualization, use data art

Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.

Microsoft

Open source projects and samples from Microsoft.

Google

Google ❤️ Open Source for everyone.

Alibaba

Alibaba Open Source for everyone

D3

Data-Driven Documents codes.

Tencent

China tencent open source team.

gitter-badger / arche Goto Github PK