Giter VIP home page Giter VIP logo

arche's Introduction

Scrapinghub command line client

PyPI Version Python Versions Tests Coverage report

shub is the Scrapinghub command line client. It allows you to deploy projects or dependencies, schedule spiders, and retrieve scraped data or logs without leaving the command line.

Requirements

  • Python >= 3.6

Installation

If you have pip installed on your system, you can install shub from the Python Package Index:

pip install shub

Please note that if you are using Python < 3.6, you should pin shub to 2.13.0 or lower.

We also supply stand-alone binaries. You can find them in our latest GitHub release.

Documentation

Documentation is available online via Read the Docs: https://shub.readthedocs.io/, or in the docs directory.

arche's People

Contributors

andersonberg avatar gitter-badger avatar manycoding avatar oik741 avatar simoess avatar tcurvelo avatar victor-torres avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arche's Issues

Make 0 items outcome more visible

Currently if a filtered job returns 0 items, the first test simply fails. While there're some hints which point on the number of returned errors - 0it, it's not visible enough.

g = Gatf(source='2235/1276/18', 
         schema='schemas/Netflix-FTE/netflix-show.json',
         target='2235/1276/18',
         filters=[("META_TEMPLATE_NAME", "=", ["Title"])],
         expand=False)

Field counts doesn't trigger error for jobs with lots of NaN

ghat = Gatf(source='364692/1/17', 
            schema='schemas/Global Strategies/amazon_product.json',
            target='364692/1/14',
#             schema=schema
           )

The price fields here differ on 15% and yet no error in log. Perhaps we don't really want to see an improved coverage?

Output keys for compare_fields_counts

It's inspired by:

Do you have any tool to find missing fields compare to previous crawls?
I would like to know which items have lost this fields to see what happened here:

At the moment this rule outputs just the field name and difference, perhaps it would be more helpful to include particular keys also.

Consuming items data to df creates inconsistencies with jsonschema

Caused by #75

Pandas makes it's own casts which is incompatible with jsonschema dict validation by default.
For example if items data:
[{"availability": 1, "_key": "0"}, {"_key": "1"}}]
or
[{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}]
DataFrame in both cases would be

  _key  availability
0    0           1.0
1    1           NaN

Yet JSON Schema null type means None, and missing field is validated with not putting it in required. So we have:
Missing field (on purpose)

{
    "required": ["_key"],
    "properties": {
        "_key": {"type": "string"},
        "availability": {"type": "integer"},
    },
    "additionalProperties": False,
}

None field (on purpose)

{
    "required": ["_key", availability],
    "properties": {
        "_key": {"type": "string"},
        "availability": {"type": ["integer", null]},
    },
    "additionalProperties": False,
}

Last but not least, the inconsistencies between JSON schema and data persist when we feed a dataframe directly (unless a user manages it himself).

Is modin worth it

https://github.com/modin-project/modin
They claim a lot, let's see what we get with the actual data.

I feel like the only thing which really makes the difference (100x times) is numpy and CPython. That's covered in fastai ML 2018, perhaps in fastai Computational Algebra too.

Refactor rules which output dataframe or series to plot them

  1. dataframe/series should be in message.stats
  2. Report plots message.stats as series barh, so this should be updated accordingly to support dataframe (and see if barh plot actually makes sense in all cases for series)
  3. Don't forget to actually ask people, perhaps they prefer text in some cases

Read data from dataframe

Support at least dataframe - which will allow to read the data locally from whatever source (csv, json, be it remote or local)

Currently the library relies on having _key to report items by it. So the implementation could look like:

  1. Figure out a simple api (fastai - datablock? like
items = Items.from_csv (items.from_job)
schema = Schema.get_schema(schema)
items.report_all(schema)

# And to keep it granular enough so it can be used in Spidermon
arche.rules.duplicates.find_by(items.df, ["name", "title"])
  1. Add _key column. Maybe it's easier to make _key as index if it's present and report index
  2. _type. So far _type nobody really needed it since we can use filters.

Reading schemas from private repos

Some schemas live inside repos and maybe they belong there along with the code.
Assuming these, it would be most convenient to fetch those schemas directly.

Both github and bitbucket provide tokens, so then it's just a matter of specifying raw url.

Sort by difference in Fields Counts

The resulting df is sorted by a field name, it makes more sense to sort it by difference instead.

                                                    Difference, %
_validation                                                   100
_validation/price_now                                         100
_validation/price_was                                         100
configurations/description                                     65
configurations/name                                            65
configurations/options                                         65
delivery_options/date_range                                    13
delivery_options/name                                          13
delivery_options/price                                         13
spec_10th_Hard_Drive                                          100

Use `pool` to download items with `filter`

The current logic of dividing on batches by start_index and count doesn't account for filter.
When using filter, returned items _key don't correspond with actual index so the data repeats.

Replace ' with "

basic_json_schema generates schemas with single quotes and then tox complains with an error: "E Expecting property name enclosed in double quotes: line 1 column 2 (char 1)"

Find duplicates of items by chosen fields

Currently there's no convenient way to find duplicates by chosen fields, only by one field or by certain tags.
Make or update the rule which consumes fields and outputs items with equal fields, e.g I want to find duplicates by key and url, so having this data:

[{"key":` 0, "name": "bob", "url": "example.com"}, {"key": 0, "name": "john", "url": "example.com"}]

I want to see 2 duplicates.

flatten_df is too slow

  1. Can it be rewritted to not use recursion?

  2. If not, profile and see how to improve. To test, use jobs with nested fields and a considerable amount of items.

  3. Add tests to check the speed

Configurable thresholds

Right now it's hardcoded in booleans, category, coverage comparison.
It makes sense to make it configurable so it can be changed globally.

    err_diffs = difs[difs > 10]
    if not err_diffs.empty:

Screenshot 2019-04-10 at 14 33 03

JSON schema summary error

errors_count of errors for items_cout of items
For example

JSON Schema Validation:
        551 errors for 277 of items

Currently, it's

JSON Schema Validation:
	500 items were checked, 9 error(s)

Rule - enum statistics

Get any enums and return df which shows the corresponding values count.

E.g. if schema has such field "enum": ["black" , "white"]
it should return a similar df (perhaps values_count() suits this purpose)

value name, percentage, count
total values 80% 40/50
black 50% 25/50
white 30% 15/50

Documentation as ipynb

Github supports ipynb, so why not.

It is as simple as:

  1. Execute everything from documentation (quickstart, etc)
  2. Add ipynb with links to the repo

Switch to another plot theme

Red is error colour, current ggplot theme has too much of it.
Red, normally, is something bad, an error, so I want to reserver the colour for errors.

Bitbucket auth to fetch files from private repos

The idea is to set id\pass (or other credentials) in jupyterhub cluster so everyone can access raw schemas from there without any additional actions.

#65 is not the best solution since 1. it requires real user credentials, meaning each user has to set their id\pass and 2. most, if not all of bitbucket ids use google SSO which doesn't work in that case.

After talk with @tcurvelo , we found there are 2 options:
Oauth2, manual steps:

  1. Create a 'client' app on some account (some account?) that has access to all repos
  2. Copy its client_id and secret and set them up on the jhub configuration
    In the code:
  3. Before requesting a resource, we need to authenticate and receive a temporary token
  4. Using the token, we can request their api.bitbucket.org endpoint

API
The idea is that we can auth with API using app password. The app password belongs to a user group, which has all the access.

  1. Create an app password
  2. Use the password in API requests to bitbucket

Sort by coverage in scraped fields

  1. Separate output by a newline - missing fields and new fields
  2. Sort by field coverage

spec_Joints_Qty - coverage - 0% - 2 items
New Fields
spec_Dimensions_with_stand_H_x_W_x_D - coverage - 0% - 2 items
g = Gatf(source="307140/6/75",
         schema='schemas/Dell/99595_dell_us.json',
         target="307141/21/36",
         expand=False)

Figure out trusted notebooks

JS from plotly/cufflinks is blocked by jupyter as not trusted.
A workaround is to make it trusted, which requires an additional action.

See what can be done (why plotly not trusted)

Include field names in jsonschema Additional Properties error

Currently one has to guess which ones:

1626 items affected - Additional properties are not allowed: 1032 609 895 1064 13

Should be like

1626 items affected - Additional properties are not allowed - "SOMETHING", "SHOULDN'T", "BE HERE": 1032 609 895 1064 13

Tags

Check Tags - specify possible tags along with specified tags:

Tags:
	used: category, product_price_field, unique, product_url_field, name_field
        not used tags: unique
	'text' field(s) was not found in source, but specified in schema
	'text' field(s) was not found in target, but specified in schema
	Skipping tag rules

Tags Rules - skip them

Do not limit schema errors by default

E.g.

2028 items affected - Additional properties are not allowed: 1122, 151, 1365, 1257, 1799
43 items affected - part_number is not of type 'string': 560, 910, 820, 1809, 369
43 items affected - name is not of type 'string': 1376, 1684, 1691, 343, 135
43 items affected - availability is not of type 'integer': 1999, 265, 1852, 820, 1849
43 items affected - availability is not one of [0, 1, 2, 3, 4, 5, 6]: 1321, 1691, 1153, 1774, 963```
Should print all 9 messages.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.