scrapinghub / arche Goto Github PK

View Code? Open in Web Editor NEW

47.0 47.0 19.0 28.54 MB

Analyze scraped data

Home Page: https://arche.readthedocs.io/

License: MIT License

Python 96.75% HTML 3.25%

data data-analysis data-visualization jupyter pandas python3 scrapy

arche's Introduction

Scrapinghub command line client

shub is the Scrapinghub command line client. It allows you to deploy projects or dependencies, schedule spiders, and retrieve scraped data or logs without leaving the command line.

Requirements

Python >= 3.6

Installation

If you have pip installed on your system, you can install shub from the Python Package Index:

pip install shub

Please note that if you are using Python < 3.6, you should pin shub to 2.13.0 or lower.

We also supply stand-alone binaries. You can find them in our latest GitHub release.

Documentation

Documentation is available online via Read the Docs: https://shub.readthedocs.io/, or in the docs directory.

arche's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger vipulgupta2048 codacy-badger victor-torres quocnguyenh gallaecio eglet27 amironoff ejulio realslimshanky imbillu wintercomes manycoding mirceachira wall-eeeeeee zanachka anujsngh burakozturkdot

arche's Issues

Extend schema with additionalProperties, additionalItems, uniqueItems

For every properties, add additionalProperties false
For every items, add additionalItems false, uniqueItems true (and check if it works with objects)

Needless to say, properties and items can be on any level.

Make 0 items outcome more visible

Currently if a filtered job returns 0 items, the first test simply fails. While there're some hints which point on the number of returned errors - 0it, it's not visible enough.

g = Gatf(source='2235/1276/18', 
         schema='schemas/Netflix-FTE/netflix-show.json',
         target='2235/1276/18',
         filters=[("META_TEMPLATE_NAME", "=", ["Title"])],
         expand=False)

Publish to pypi

Add mini pictures as https://pipenv.readthedocs.io/en/latest/

Field counts doesn't trigger error for jobs with lots of NaN

ghat = Gatf(source='364692/1/17', 
            schema='schemas/Global Strategies/amazon_product.json',
            target='364692/1/14',
#             schema=schema
           )

The price fields here differ on 15% and yet no error in log. Perhaps we don't really want to see an improved coverage?

Graphs are not shown in Notebook UI

For some reason they are blank. Maybe some library is missing.

Schema methods should fail better if schema is not provided

Even for me it takes some seconds to figure what it just doesn't work.
I see it's either a minus in a design - e.g. it should feel like you have to add schema and see it clearly
Or a better error is needed.

Output keys for compare_fields_counts

It's inspired by:

Do you have any tool to find missing fields compare to previous crawls?
I would like to know which items have lost this fields to see what happened here:

At the moment this rule outputs just the field name and difference, perhaps it would be more helpful to include particular keys also.

Consuming items data to df creates inconsistencies with jsonschema

Caused by #75

Pandas makes it's own casts which is incompatible with jsonschema dict validation by default.
For example if items data:
[{"availability": 1, "_key": "0"}, {"_key": "1"}}]
or
[{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}]
DataFrame in both cases would be

  _key  availability
0    0           1.0
1    1           NaN

Yet JSON Schema null type means None, and missing field is validated with not putting it in required. So we have:
Missing field (on purpose)

{
    "required": ["_key"],
    "properties": {
        "_key": {"type": "string"},
        "availability": {"type": "integer"},
    },
    "additionalProperties": False,
}

None field (on purpose)

{
    "required": ["_key", availability],
    "properties": {
        "_key": {"type": "string"},
        "availability": {"type": ["integer", null]},
    },
    "additionalProperties": False,
}

Last but not least, the inconsistencies between JSON schema and data persist when we feed a dataframe directly (unless a user manages it himself).

Is modin worth it

https://github.com/modin-project/modin
They claim a lot, let's see what we get with the actual data.

I feel like the only thing which really makes the difference (100x times) is numpy and CPython. That's covered in fastai ML 2018, perhaps in fastai Computational Algebra too.

Compare tagged fields between two dataframes

Compares any tagged fields in schema between two dataframes.
Return the difference (new, missing, same)

Refactor rules which output dataframe or series to plot them

dataframe/series should be in message.stats
Report plots message.stats as series barh, so this should be updated accordingly to support dataframe (and see if barh plot actually makes sense in all cases for series)
Don't forget to actually ask people, perhaps they prefer text in some cases

Show boolean distribution graph for one job

There is a rule which compares booleans between two jobs, why not show the distribution for one?

Use display() to plot graphs instead of iplotly once JupyterLab 1 is released

plotly/plotly.py#1516 (comment)
plotly/plotly.py#1522 (comment)
In the next version, ipywidgets have the functionality to save the widget state. Maybe it's convenient enough to switch on widgets completely.

The bug is that widgets are duplicated by plotly.

Read data from dataframe

Support at least dataframe - which will allow to read the data locally from whatever source (csv, json, be it remote or local)

Currently the library relies on having _key to report items by it. So the implementation could look like:

Figure out a simple api (fastai - datablock? like

items = Items.from_csv (items.from_job)
schema = Schema.get_schema(schema)
items.report_all(schema)

# And to keep it granular enough so it can be used in Spidermon
arche.rules.duplicates.find_by(items.df, ["name", "title"])

Add _key column. Maybe it's easier to make _key as index if it's present and report index
_type. So far _type nobody really needed it since we can use filters.

Reading schemas from private repos

Some schemas live inside repos and maybe they belong there along with the code.
Assuming these, it would be most convenient to fetch those schemas directly.

Both github and bitbucket provide tokens, so then it's just a matter of specifying raw url.

Sort by difference in Fields Counts

The resulting df is sorted by a field name, it makes more sense to sort it by difference instead.

                                                    Difference, %
_validation                                                   100
_validation/price_now                                         100
_validation/price_was                                         100
configurations/description                                     65
configurations/name                                            65
configurations/options                                         65
delivery_options/date_range                                    13
delivery_options/name                                          13
delivery_options/price                                         13
spec_10th_Hard_Drive                                          100

Use `pool` to download items with `filter`

The current logic of dividing on batches by start_index and count doesn't account for filter.
When using filter, returned items _key don't correspond with actual index so the data repeats.

Graphs are invisible

display(HTML(plot(f, include_plotlyjs=False, output_type="div")))

Analyze categorical data

https://pandas.pydata.org/pandas-docs/stable/categorical.html

This could be helpful to analyze data. At the minimum level, we can detect categorical fields and print the stats, so the discrepancies will be easily noted.
At maximum, we could compare the difference between jobs toward a certain threshold.

The idea is the same as enums #132, but doesn't rely on the json schema.

Could not generate basic schema - SchemaGenerationError: Could not find matching type for object

Any source returns an error:
SchemaGenerationError: Could not find matching type for object: 1

Replace ' with "

basic_json_schema generates schemas with single quotes and then tox complains with an error: "E Expecting property name enclosed in double quotes: line 1 column 2 (char 1)"

Publish plots in documentation notebooks

Sphinx and github do not render any javascript, hiding some output.
Maybe there is another way to show it (pictures?).
#3

Store all error items keys in result

It will help to easily extract information, e.g. what % of items failed

There won't be need in err_items_count property.

Find duplicates of items by chosen fields

Currently there's no convenient way to find duplicates by chosen fields, only by one field or by certain tags.
Make or update the rule which consumes fields and outputs items with equal fields, e.g I want to find duplicates by key and url, so having this data:

[{"key":` 0, "name": "bob", "url": "example.com"}, {"key": 0, "name": "john", "url": "example.com"}]

I want to see 2 duplicates.

flatten_df is too slow

Can it be rewritted to not use recursion?
If not, profile and see how to improve. To test, use jobs with nested fields and a considerable amount of items.
Add tests to check the speed

Use fastjsonschema for huge jobs by default

https://github.com/scrapinghub/gatf/pull/237#pullrequestreview-206687741

Since it's so fast, we can simplify people's life.

Find the common root for URL fields

Say, we scrape one website from different categories. In this case all items will have the same root, e.g.
https://pandas.pydata.org/pandas-docs/stable/categorical.html
https://pandas.pydata.org/pandas-docs/stable/merging.html
have https://pandas.pydata.org/pandas-docs/stable/ in common.

By returning this information, we can analyze urls without json schema.

Move from private Drone to Travis CI

Configurable thresholds

Right now it's hardcoded in booleans, category, coverage comparison.
It makes sense to make it configurable so it can be changed globally.

    err_diffs = difs[difs > 10]
    if not err_diffs.empty:

JSON schema summary error

errors_count of errors for items_cout of items
For example

JSON Schema Validation:
        551 errors for 277 of items

Currently, it's

JSON Schema Validation:
	500 items were checked, 9 error(s)

Rule - enum statistics

Get any enums and return df which shows the corresponding values count.

E.g. if schema has such field "enum": ["black" , "white"]
it should return a similar df (perhaps values_count() suits this purpose)

value name, percentage, count
total values 80% 40/50
black 50% 25/50
white 30% 15/50

Ugly progress bar if using Pool while downloading items

There is a bug, maybe this one - tqdm/tqdm#485 which prevents from using tqdm_notebook in JupyterHub, Lab or Notebook. At the moment the output is blank.

The easiest way seemed to simply use tqdm, I don't want to implement another progress bar.

Clickable urls in dataframe views

It will be really convenient to have clickable urls, as opposing to manually copying them and opening a page each time. It will save time.
As far as I remember pandas has https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.formats.style.Styler.html#pandas.io.formats.style.Styler, so maybe it can be used to achieve it.

Documentation as ipynb

Github supports ipynb, so why not.

It is as simple as:

Execute everything from documentation (quickstart, etc)
Add ipynb with links to the repo

Add environment packages

Similar to https://github.com/fastai/fastai/blob/master/setup.py
The goal is to have an easy-to-set environment, since environment are not dependencies by nature.
E.g. The library should run in Jupyter, but Jupyter is not a dependency, it's an environment, so it's makes sense to put it in different category.

JupyterLab performance degrades with the number of plotly graphs

It seems that in a notebook graphs are printed in a smaller size first, and then autosized to the page. This autosizing freezes the page for a second and looks like an issue.
Set the proper size by default?

Add category counts to category coverage plot's ylabel

Currently, the concrete values count shown when the row is being pointed at.

@alexandr1988 states that putting values count to ylabel on a graph will be more convenient.

Plotly and cufflinks seem to increase a notebook size on 12mb after import arche

Fix readthedocs

It doesn't show 0.3 versions https://arche.readthedocs.io/en/latest/nbs/schema.html

Switch to another plot theme

Red is error colour, current ggplot theme has too much of it.
Red, normally, is something bad, an error, so I want to reserver the colour for errors.

Bitbucket auth to fetch files from private repos

The idea is to set id\pass (or other credentials) in jupyterhub cluster so everyone can access raw schemas from there without any additional actions.

#65 is not the best solution since 1. it requires real user credentials, meaning each user has to set their id\pass and 2. most, if not all of bitbucket ids use google SSO which doesn't work in that case.

After talk with @tcurvelo , we found there are 2 options:
Oauth2, manual steps:

Create a 'client' app on some account (some account?) that has access to all repos
Copy its client_id and secret and set them up on the jhub configuration
In the code:
Before requesting a resource, we need to authenticate and receive a temporary token
Using the token, we can request their api.bitbucket.org endpoint

API
The idea is that we can auth with API using app password. The app password belongs to a user group, which has all the access.

Create an app password
Use the password in API requests to bitbucket

compare_prices_for_same_urls fails if url is nan

Sort by coverage in scraped fields

Separate output by a newline - missing fields and new fields
Sort by field coverage


spec_Joints_Qty - coverage - 0% - 2 items
New Fields
spec_Dimensions_with_stand_H_x_W_x_D - coverage - 0% - 2 items

g = Gatf(source="307140/6/75",
         schema='schemas/Dell/99595_dell_us.json',
         target="307141/21/36",
         expand=False)

Replace colorama styling with Jupyter API

Colorama is for console, Jupyter has its own tools.

Figure out trusted notebooks

JS from plotly/cufflinks is blocked by jupyter as not trusted.
A workaround is to make it trusted, which requires an additional action.

See what can be done (why plotly not trusted)

Include field names in jsonschema Additional Properties error

Currently one has to guess which ones:

1626 items affected - Additional properties are not allowed: 1032 609 895 1064 13

Should be like

1626 items affected - Additional properties are not allowed - "SOMETHING", "SHOULDN'T", "BE HERE": 1032 609 895 1064 13

Save graphs between sessions

At the moment ipywidgets are not saved between sessions, meaning the output is lost once a notebook is closed. jupyterlab/jupyterlab#5235

And using plotly.offline.iplot() duplicates any widgets, making the output polluted with copies of progress bars (which are widgets). jupyter-widgets/ipywidgets#2359

Do not limit schema errors by default

E.g.

2028 items affected - Additional properties are not allowed: 1122, 151, 1365, 1257, 1799
43 items affected - part_number is not of type 'string': 560, 910, 820, 1809, 369
43 items affected - name is not of type 'string': 1376, 1684, 1691, 343, 135
43 items affected - availability is not of type 'integer': 1999, 265, 1852, 820, 1849
43 items affected - availability is not one of [0, 1, 2, 3, 4, 5, 6]: 1321, 1691, 1153, 1774, 963```
Should print all 9 messages.