datopian / assembler Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 2.0 141 KB

The DataHub data assembly line

License: MIT License

Makefile 0.44% Python 98.19% Shell 0.89% Dockerfile 0.49%

assembler's People

Contributors

Stargazers

Watchers

Forkers

isabella232 erdal-pb

assembler's Issues

Fine-tune the number of rows needed for preview resources

Choose a good number of rows for previewing the data (based on our current core datasets).

Add documentation of service env variables to README

E.g. DPP_ELASTICSEARCH variable (with an example to illustrate whether it is fully qualified or not ...)

See datopian/deploy#19

Parallel pipelines

Problem: you republish of the dataset and it will take a few minutes even if only metadata changed.

Split the pipeline into: metadata and data pipelines

pkgstore files should have file type of text/plain

Copying from openknowledge-archive/dpr-api#273

When we upload any file in pkg store the file type by default set to binary/octet-stream. But we want text/plain:

e.g. datapackage.json has binary/octet-stream should be json. Try:

curl -I http://pkgstore-testing.datahub.io/core/finance-vix/latest/datapackage.json

Why is this a problem:

when you click on Metadata in the views page the datapackage.json downloads rather than being viewable in the browser.
if you view these files you get auto-download. Like github raw your want text/plain (this is also good for security)

Want right content type for JSON and for everything else set text/plain (or could even just have text/plain for json too).

Solution

Add Content-Type while publishing data to S3

Acceptance Criteria:

In pkgstore all data should be of type text/plain

Black box tests for the website

Acceptance croteria

page with views, without views, large, small, (non-tabular?) are tested

Tasks

Choose a few pages that we want to always work
Verify that pages load in a reasonable time
- homepage - < 1sec
- showcase page - < 15 sec
Verify that download links work
Verify that downloaded data makes sense
Verify that views appear correctly

Should index the datapackage into elasticsearch instance in the end of the pipeline

INVALID state for pipelines

If you go and visit http://datahub.io/anuveyatsu/gdp/pipelines, you can see that it has status of INVALID and error:

Dirty dependency: Cannot run until all dependencies are executed
Dependency unsuccessful: Cannot run until all dependencies are successfully executed

Processors for getting preview

We need two new processors to be able to have preview resources - processors for getting first N rows from given resource and processor for adding views to the datapackage.json

parameters: limit (will default to 10000)

Acceptance Criteria

after the processors run we have sperate resource with only N number of rows

Tasks

add processors that yields only N (or 10000) rows
add processors that creates (or modifies) views
tests

Data Update is not happening with scheduled jobs

One of our core datasets finance-fix that should be updated daily, is stuck on 2nd of October - the scheduled job is running but data is never updated.

Acceptance criteria

we have the lates data after scheduled run

Tasks

~~reproduce locally~~
investigate reason (see analysis)
fix

Analysis

This is happening cause only "main" pipeline is running on scheduled time and derived ones remain same

Don’t auto-redeploy containers on production

Export resource(s) to sqlite

We need to export data into a sqlite if requested

Acceptance criteria

There is SQLite file on S3

Tasks

do analysis
function for generating processor (to append to pipeline list)
edit source spec generator

Analysis

~~Parameters:~~

~~file-name: optional #defaults to <>.db~~
~~resource-names: [resource-one, reource-two] # required~~
~~table names will be same as resources names (with underscores _)~~
~~mode will always be rewrite (always creates new tables)~~
- ~~other options would be appen (append new rows) and update (update if row exists). but let's keep it simple~~

path on S3: data/sqlite/data/{resourcename}.db

Spec

# request body from CLI
{
  ...
  kind: sqlite
}

Analysis

meta:
  owner: <owner username>
  ownerid: <owner unique id>
  dataset: <dataset name>
  version: 1
  findability: <published/unlisted/private>
inputs:
  -  # only one input is supported atm
    kind: datapackage
    url: <datapackage-url>
    parameters:
      resource-mapping:
        <resource-name-or-path>: <resource-url>
outputs:
  -
    kind: sqlite

when sqlite is in outputs, we need to add two processors:

dump.to_sql into a temporary file
add_resource to add that resource to the datapackage (with the proper path and datahub type to indicate it’s a derivative of which resource)

# pipeline-spec
meta:
  ...
inputs:
  - 
    kind: datapackage
    ...
outputs: 
  -
    kind: sqlite

# generator.py in assambler
pipeline = [current_pipeline]
for output in outputs:
    if output[kind] == 'sqlite':
          pipeline.append({run: dump.to_sql, parameters: {engine: lsqlite:///}}})
    etc..

yield pipeline_id, {pipeline: pipeline}

Questions:

What should be path for it?

BBTest for private dataset

We are introducing private datasets for users. We need to test the pipelines executed fine and links are returning 403

Not all pipelines are executed after they are finished

When developing locally and running all pipeline, not all the, are executed - no errors, no warnings they are marked as successful, but actually some of them are not executed at all. (I strongly believe this is happening not only locally, but in production as well - see second comment).

Acceptance criteria

all pipelines should be executed

Tasks

analysis
compe up with solution
fix

Analysis

Recently we've implemented new pipeline that generates preview resource. This pipeline cannot be executed until derived/csv. Also, cannot be executed "the main" pipeline that is depended on all others are finished. So

When I execute pipelines for the first time (dpp run all) nontabular plus all derived ones plus successfully executed (including preview), there are no error messages or warnings. However, there is no the final one (the one that should contain all together plus original)
After executing same exact command for second time (dpp run all) now that final one gets there and we have complete dataset

For example, running dpp list all pipelines with their errors. running dpp for first time (non of pipleines are executed yet) Logs:

Available Pipelines:
INFO    :Main                            :Skipping redis connection, host:None, port:6379
- ./core/finance-vix:non-tabular (*)
- ./core/finance-vix:vix-daily_csv (*)
- ./core/finance-vix:vix-daily_csv_preview (E)
    Dirty dependency: Cannot run until all dependencies are executed
    Dependency unsuccessful: Cannot run until all dependencies are successfully executed
- ./core/finance-vix:vix-daily_json (*)
- ./core/finance-vix (E)
    Dirty dependency: Cannot run until all dependencies are executed
    Dependency unsuccessful: Cannot run until all dependencies are successfully executed
    Dirty dependency: Cannot run until all dependencies are executed
    Dependency unsuccessful: Cannot run until all dependencies are successfully executed

Running dpp for second time (after first dpp run all finishes):

Available Pipelines:
INFO    :Main                            :Skipping redis connection, host:None, port:6379
- ./core/finance-vix:non-tabular 
- ./core/finance-vix:vix-daily_csv 
- ./core/finance-vix:vix-daily_csv_preview (*)
- ./core/finance-vix:vix-daily_json 
- ./core/finance-vix (E)
    Dirty dependency: Cannot run until all dependencies are executed
    Dependency unsuccessful: Cannot run until all dependencies are successfully executed

Running dpp now (after second dpp run all finished) - now everything is fine

Available Pipelines:
INFO    :Main                            :Skipping redis connection, host:None, port:6379
- ./core/finance-vix:non-tabular 
- ./core/finance-vix:vix-daily_csv 
- ./core/finance-vix:vix-daily_csv_preview 
- ./core/finance-vix:vix-daily_json 
- ./core/finance-vix (*)

how dpp run all works

checks for available pipelines -> iterates over them -> checks if pipeline has errors or already executed (skips if so) -> executes -> updates pipeline's status dirty=False
If at least one is executed from above -> again checks for all pipelines -> iterates -> checks for errors/already executed -> executes -> updates
does same until all of them executed -> nothing executed / no more dirty -> Finish

in our case, two pipelines have dependencies: derived_preview and main pipeline. So when iterating it goes like this:

Take a look at all of them -> skip preview and main for now -> execute others -> change dirty=False
Take a look at all of them -> skip main (as preview is still dirty) and skip all others (as they are already executed) -> execute preview -> change dirty=False for some reason this canhge is not happening
take a look at all of them -> skip main again (as preview is still marked as dirty) -> nothing is executed -> Finish

I can not find the reason why status of successfully executed pipeline is still dirty for final iteration.

NOTE: this is not happening if you run dpp run dirty. All pipelines are executed fine in that case

problems with exported zip files

Feedback after reviewing export to zip.

zip file should be named after dataset name
zipped json has wrong encoding
when generating zip resources have descriptions saying what is it
All datasets by default should have zip resource

Acceptance Criteria

able to read json files after unzip
named after datasets and with small description
all datasets hacve zip resource

Tasks

set out-file ot dataset-name.zip instead of datahub.zip [0.5]
reproduce zip with wrong encoding for json locally and debug [3]
define (find out) what should description say and upade [0.5]
refactor code so that all datasets have ziped resource [1]

Pipelines are generated according to old/cached source-spec

I Have pushed a new source spec for one of the core datasets using specstore API.

API responds with success=true
manually checked database and new spec is there
Assembler sees that something changed in DB and triggers the planner
However dashboard shows the "old" source-spec and pipelines are run according to that
Result: No changes to my dataset

Acceptance Criteria

pipelines are run according to newest source-spec pushed

Analysis

The source spec I pushed (and currently in DB):

meta:
  dataset: finance-vix
  findability: published
  owner: core
  ownerid: core
  version: 2
inputs:
- kind: datapackage
  parameters:
    resource-mapping:
      vix-daily: http://www.cboe.com/publish/ScheduledTask/MktData/datahouse/vixcurrent.csv
  url: https://raw.githubusercontent.com/datasets/finance-vix/master/.datahub/datapackage.json
processing:
  -
    input: vix-daily
    tabulator:
      skip_rows: 2
      headers:
        - Date
        - VIXOpen
        - VIXHigh
        - VIXLow
        - VIXClose
    output: vix-daily
schedule:
    crontab: '0 0 * * *'

Source spec from dashboard http://api.datahub.io/pipelines/#anchor-ALL-core-finance-vix

inputs:
- kind: datapackage
  parameters:
    resource-mapping:
      vix-daily: https://s3.amazonaws.com/rawstore.datahub.io/9c46b3948d297fa1c4e92ead714f0399
  url: https://s3.amazonaws.com/rawstore.datahub.io/66387913a43fc2a04ca602ce0d529b1c
meta:
  dataset: finance-vix
  findability: unlisted
  owner: core
  ownerid: core
  update_time: '2017-09-26T10:02:32.931873'
  version: 1

generator to read pipelines from the flowmanager db.

Currently, we are using SourceSpecRegistry to read from DB what pipelines need to be run. Since we extracted planning part of assembling and changed the flow management, SourceSpecRegistry is not useful anymore.

Acceptance Criteria

generator reads from new db and runs pipelines accordingly

Tasks

use FlowManager as dependency and query the DB
integrate with BB tests

Store files on S3 based on their hash

Generated artifacts should be stored on pkgstore S3 based on their hash, and not their path or pipeline-id properties.

Preview views are not generated if resources is small

TODO: proper description and taskes

Skip creation of preview resources if original resource is already small

2 core data packages are not passed pipeline

We have 2 core data packages that have not passed pipeline. The problem is geopoint type.
In the first time, it errored with invalid geopoint type. Then I fixed both of them. Now we have a different error, probably with jsontableschema library.

airport-codes - https://github.com/datasets/airport-codes
smdg-master-terminal-facilities-list - https://github.com/datasets/smdg-master-terminal-facilities-list

Error outcome from pipeline:

AttributeError: 'list' object has no attribute 'split'

Dump to pkgstore should extract readme and push to README.md

It seems best way to handle README is to store inline in datapackage.json in rawstore.

This means assembler pipeline must extract readme from readme property and dump to README.md (and delete that property from the datapackage.json) when dumping to the pkgstore.

load_modified_resources should not overwrite resources but append to them

Report events to ES: WHO did WHAT WHEN

Tasks

do analysis how to gather needed info
do analysis how to put this on elastic search
Check following:
- [Flow] Push completed
  - Success / Failure
- Dataset created (special case of push - where push is a new dataset)
- [Flow] Push started (time elapsed to success can be ca)
- Download (clicks, GET on data file)
- User sign up
- User sign in
- (Page view)
  - Search of term

Deal with generating first/last XXX rows for preview resources

how we are going to deal with sorting data that is time sensitive. Eg: for Finance Vix, that is updated daily, we need last 200 rows (or maybe first for other datasets), to display the latest data

Acceptance Criteria

we have latest data for preview

Tasks

List possible solutions
Choose best for us
implement

Analysis

One solution for this would be including some info in views Eg: {desc: true/false}
Another option would be to include that info inside processing object of source-spec

cc @rufuspollock and @akariv what do you think about this?

Enable travis for this repo

Preview large datasets

As a Publisher I want to upload a 50Mb CSV and have the showcase page work

Acceptance criteria

We have preview version of all resources

Tasks

read refactored code and understand how it works
add new custom processor that will iteration stop at 10000 (in any case - even data is < 50mb) #23
- Number should be configurable. Also this should come from the views right - doesn't
add new custum processor that will modify dp.json and add views #23
add new Processing node for Preview in basic_nodes.py #22
update node_collector.py with new processing node
refactore load_modified_resources.py to handle dp.json with views

analysis

We will need two new processors: generate_views (for adding the views) and one that just takes derived resources (json) and yields only first 10000 rows.

Modifications for basic_nodes.py:

class DerivedFormatProcessingNode(BaseProcessingNode):

  def get_artifacts(self):
    for artifact in self.available_artifacts:
      if artifact.datahub_type == 'derived/json':  #or csv
        datahub_type = 'derived/preview' 
        resource_name = artifact.resource_name + '_preview.json'
        output = ProcessingArtifact(
        datahub_type, resource_name,
        [artifact], [],
        [(new processors + old goes here)],
                True
            )
            yield output

class DerivedPreviewProcessingNode(...):
  def __init__(self):
    ... # (all same)

Inside node_collector.py we will have to add DerivedPreviewrocessingNode in the appropriate order. (should come after derived json as it is depended on that)

And we will have to modify load_modified_resources.py to handle "views" as the views are not yet supported there.

Example datapackage.json after running pipelines:

{
  "resources": [
    {
      "name": "vix-daily",
      "path": "data/vix-daily.csv",
      "format": "csv",
      "mediatype": "text/csv",
      "schema": {...}
    },
    {}, // derived csv
    {}, // derived json
    { // even though json data we have a table schema
      "name": "vix-daily_csv_preview",
      "path": "data/preview/vix-daily.json",
      "datahub": {
        "derivedFrom": [
          "vix-daily_csv"
        ],
        "forView": [
          "datahub-preview-vix-daily_csv_preview"
        ],
        "type": "derived/preview" // data will be json ...
      }
    }
  ],
  "views": [
    {
      "name": "graph",
      "title": "VIX - CBOE Volatility Index",
      "specType": "simple",
      "spec": {
        "type": "line",
        "group": "Date",
        "series": [
          "VIXClose"
        ]
      }
    },
    {
      name: 'datahub-preview-vix-daily_csv_preview',
      specType: 'table',
      datahub: {
        type: 'preview'
      },
      transform: {
        limit: 10000
      },
      resources: [
        'vix-daily_csv_preview'
      ]
    }
  ]
}

Processing node for preview datasets

We need new processing node to define one more processing flow (preview datasets). For that, we need to implement a class inheriting from BaseProcessingNode, and add that class name to the ORDERED_NODE_CLASSES array. new processing node will iterate over all resources and add processors that are generating first 10000 rows and views to the ones that have type derived/json

Acceptance criteria

after able to run load_previews and load_views from #23

Tasks

define new class for preview processing node class DerivedPreviewProcessingNode that will inherit DerivedFormatProcessingNode
in the DerivedPreviewProcessingNode override get_artifacts and Iterate over existing artefacts (resources)
check for the ones that are derived/json (or csv?)
and pass the processors that we will need to get the desired resources and views.
add DerivedPreviewProcessingNode to the ORDERED_NODE_CLASSES in correct position (last in the list)

Analysis

Example how refactored basic_nodes.py may look like

class DerivedFormatProcessingNode(BaseProcessingNode):

  def get_artifacts(self):
    for artifact in self.available_artifacts:
      if artifact.datahub_type == 'derived/csv':
        datahub_type = 'derived/preview' 
        resource_name = artifact.resource_name + '_preview.'
        output = ProcessingArtifact(
        datahub_type, resource_name,
        [artifact], [],
        [(new processors + old goes here)],
                True
            )
            yield output

class DerivedPreviewProcessingNode(...):
  def __init__(self):
    ... # (all same)

load_modified_resources should aggregate stats correctly

The resulting datapackage should have the bytes and rowCountset correctly:

rowCount should be the sum of all rowcounts of derived/csv resources.
bytes should be the sum of all bytes from all resources.

Files in Exported zip files are placed in "hash" folders

If you try and download zip files, data inside is placed in following structure

|- data/data.csv
|- data/filehash1/data_csv.csv
|- data/filehash2/data_json.csv
|- datapackage.json

We don't need hashes

Acceptance Criteria

no "hash folders" inside zip

Tasks

create new processor that removes hashes form file paths
update planner to run that processor before dump.to_zip in outputNodes

Split the non-tabular node to generate one item per resource

Currently, we are copping across non-tabular resources, without any modification to dp.json. We need to change this behaviour as in some cases we have to reuse and add it as a new resource to the dp.json. add_resource processor searches the resource in dp.json with its name and as non-tabular ones are copied across, resource names are copied from originals. So we are getting AssertionError: Failed to find a resource with the index or name matching 'non-tabular' as a planner is trying to add resource using artefact name (add_resource, {resource: resource_info[required_artifact.resource_name]}...)

Acceptance Criteria

generate one item per resource for non-tabular ones as well

Tasks

refactor NonTabularProcessingNode in basic_nodes.py to yield artifact with correct resource names {resource name}_non_tabular

Put events on Elastic Search

Besides metadata, we want to store events on Elastic Search

Acceptance Criteria

able to query the existing Elastic Search Eg: with Kibana for now (later with API) and get successful push events

Tasks

create new index "events"
for now save only sucsessfull executions
Design index in the following way

{
  "timestamp": ...
  // type of event
  "event_entity": "flow|account|..."
  "event_action": "create/finished/deleted/...
  // event filters
  "owner": ...
  "dataset": ... // may not always be there ...
  // results
  "outcome": "ok/error", or "status": "good/bad/ugly",
  "messsage": ... // mostly blank
  "findability": ...
  "payload": {
    // anything you want add - customizing the event info per object
    flow-id (aka pipeline-id)
  }
}

Pretty print datapackage.json (?)

When writing datapackage.json to pkgstore pretty print json so it is nice to read for users (this is convenient when directly browsing).

Not sure this is that important ;-)

Resource without a format field results in a crash of pipeline

stream_remote_resources: WARNING :Error while opening resource from url https://s3.amazonaws.com/rawstore-testing.datahub.io/sQqpgDlCdaDFdRjzxbZN9Q==: FormatError('Format "None" is not supported',)
stream_remote_resources: Traceback (most recent call last):
stream_remote_resources:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/specs/../lib/stream_remote_resources.py", line 144, in opener
stream_remote_resources:     _stream.open()
stream_remote_resources:   File "/usr/local/lib/python3.6/site-packages/tabulator/stream.py", line 132, in open
stream_remote_resources:     raise exceptions.FormatError(message)
stream_remote_resources: tabulator.exceptions.FormatError: Format "None" is not supported
stream_remote_resources: During handling of the above exception, another exception occurred:
stream_remote_resources: Traceback (most recent call last):
stream_remote_resources:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/specs/../lib/stream_remote_resources.py", line 201, in 
stream_remote_resources:     rows = stream_reader(resource, url, ignore_missing or url == "")
stream_remote_resources:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/specs/../lib/stream_remote_resources.py", line 159, in stream_reader
stream_remote_resources:     schema, headers, stream, close = get_opener(url, _resource)()
stream_remote_resources:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/specs/../lib/stream_remote_resources.py", line 153, in opener
stream_remote_resources:     _stream.close()
stream_remote_resources:   File "/usr/local/lib/python3.6/site-packages/tabulator/stream.py", line 156, in close
stream_remote_resources:     self.__parser.close()
stream_remote_resources: AttributeError: 'NoneType' object has no attribute 'close'

Run black box tests on testing constantly

Why pkgstore URL's have ":" in the middle rather than "/"

If you take a look at the paths of the resources in the datapakcage.json they look like this: "https://pkgstore.datahub.io/core/finance-vix:non-tabular/data/vix-daily.csv"

The structure on the S3 is the same:

| pkgstore.datahub.io
--| core
--|--| finance-vix:non-tabular
--|--|--| data
--|--|--|--| vix-daily.csv

@akariv Is this done on purpose or there should be / and that is a bug?
Example dp.json: https://pkgstore.datahub.io/core/finance-vix/latest/datapackage.json

Acceptance criteria

we know the reason why it is there
if that is a bug repalce with /

Tasks

investigate reason
fix if neccssary
rerun assembler

[EPIC] Event store so that we can measure statistical quality of factory and frontend

As an admin, I want to record most activities (sign up, sign in, push etc) so that I'm know how frequently users are using platform and how often they achieve expected results

As a Publisher, I want to see what has happened recently with my data (and in future what my team has done) so that I know things are working, what's changed etc

As a Viewer I want to see what a Publisher has been up to so that I know whether they are active, what's new and what's hot (so that ...)

Acceptance Criteria

We have easily measurable statistics

Tasks

Report events to ES: WHO did WHAT WHEN #52
- do analysis how to gather needed info
- do analysis how to put this on elastic search
Record push completion (always successful)
Display push completion in dashboard (?)
Record push fail
Record push start
Record ...
Write blog post about new dashboard

Analysis

In order to do statistical analyses we need to record most possible activities on the platform. Good place for storing info would be the elastic search

What are the events?

[Flow] Push completed [assembler]
- Success / Failure
Dataset created (special case of push - where push is a new dataset) [spec store?]
[Flow] Push started (time elapsed to success can be ca) [assembler]
Download (clicks, GET on data file) [Google Analytics]
User sign up (happens through out system)
(?) User sign in to frontend (?) (happens through our system)
(Page view) [Google Analytics]
- Search of term [Google Analytics]

Design of the ES index

{
  "timestamp": ...
  // type of event
  "event_entity": "flow|account|..."
  "event_action": "create/finished/deleted/...
  // event filters
  "owner": ...
  "dataset": ... // may not always be there ...
  // results
  "outcome": "ok/error", or "status": "good/bad/ugly",
  "messsage": ... // mostly blank
  "findability": ...
  "payload": {
    // anything you want add - customizing the event info per object
    flow-id (aka pipeline-id)
  }
}

{
  "timestamp":
  "owner":
  "payload": ...
}

Desired Queries

For user dashboard:

SELECT * FROM events where event.owner = userid SORT DESC timestamp;

SELECT * FROM events where ... AND

For user public profile

SELECT * FROM events where event.owner = profilid AND findability == 'public' AND type != 'login'

Internally

SELECT count(*) FROM events where type = "dataset/push" AND outcome == "failed"

Collection

should a collection of data happen daily, or on every event?
- Depends on where we are getting data from but preferably on every event (push style). For some external data e.g. downloads we may need to pull on a regular basis.
How to automate this?
how to gather as they are split across different services/API?

Derived preview resources are too large

Recently, we have implemented generating derived/preview resources, which are first 10k rows of the original resource. It was decided to make it in JSON format for convenience in the frontend. However, as we can see now, JSON versions usually are larger in size than CSV versions. E.g., for finance-vix dataset, original resource is ~228KB, while derived/preview version is ~528KB. It becomes crucial when loading resources that are several MB in sizes.

Another point is we could skip generating derived/preview if the number of rows is <10k as it does not make a lot of sense or am I missing something? Frontend would use original resource for the preview if derived/preview is not found. Notice, that we have to generate preview views anyway.

In addition, we could reduce number of rows for preview resources, e.g., only first 1k rows. Consider https://pkgstore.datahub.io/f512ef8d39ada702fde60efe1ca59c17/farm-url/latest/datapackage.json which has a resource with 4987 rows . What do you think?

Bytes property is showing wrong information

If you check this dataset - https://pkgstore.datahub.io/f512ef8d39ada702fde60efe1ca59c17/farm-url/latest/datapackage.json :

source/tabular resource - bytes: 51811260, which is ~52MB but in fact it is ~17MB
same applies to all versions of resources and dataset itself

Tests are failing cause of incorrect DB setup

We have failing test on Travis. There are multiple revisions saved while should be one per each dataset. This is happening cause we are erasing data only ones from db (on the start) - we need to do that after each test

Acceptance criteria

Travis is passing

Tasks

delete from tables after each test

Test for granular output types

@zelima commented on Thu Nov 09 2017

Depending on config passed to planner we have a different kind of outputs. We need to test all of them

Acceptance Criteria

tested most of the possible outputs

Tasks

produces a datapackage.json (does nothing) - (keeps original resources)
- Planner that does basically nothing - just does the datapackage.json copy across ...
produces just (derived) csv
produces csv, json
produces csv, json, preview
produces csv, json, preview, zip

Pretty print datapackage.json

As a user, I want to see datapackage.json prettified other than have it on one line, so that it is easy for me to read it

Acceptance Criteria

datapackage,json is pretty printed

Tasks

Do analysis: explore code where json resources are generated
refactor to write prettified version of json

Black box tests for assembler + pipelines

Acceptance criteria

covers most cases of spec - regular, remote, zip, etc...

Tasks

Choose a few sample flow specs that we want to always work
Run them as part of the test suite using dpp
Verify the outputs

How do we get data from S3 in compressed (i.e. gzip) form?

Research and implement storing data in S3 so that it's downloaded using gzip compression.
(might already be this way, need to check first).

Getting too many connections when running the pipelines

RDS is getting too many connections, cause each new FlowRegistry object in Generator creates a new connection to RDS. One such object should suffice. (see analysis)

Acceptance criteria

RDS is not shutting down case of too many connections

Tasks

Move

Analysis

current state

class Generator(GeneratorBase):

    @classmethod
    def get_schema(cls):
        return json.load(open(SCHEMA_FILE))

    @classmethod
    def generate_pipeline(cls, source):
        registry = FlowRegistry(DB_ENGINE)
        count = 0
        for pipeline in registry.list_pipelines():  # type: Pipelines
            yield pipeline.pipeline_id, pipeline.pipeline_details
            count += 1
        logging.error('assember sent %d pipelines', count)

Solution

REGISTRY = FlowRegistry(DB_ENGINE)
class Generator(GeneratorBase):

    @classmethod
    def get_schema(cls):
        return json.load(open(SCHEMA_FILE))

    @classmethod
    def generate_pipeline(cls, source):
        count = 0
        for pipeline in REGISTRY.list_pipelines():  # type: Pipelines
            yield pipeline.pipeline_id, pipeline.pipeline_details
            count += 1
        logging.error('assember sent %d pipelines', count)

Export datapackage to zip

We need to export data into a zip if requested

Acceptance criteria

There is zip file on S3

Tasks

Analysis

NTS: i think we are starting to want a method that "in-directories" data: given a datapackage.json it downloads it, retrieves all data files converting path urls into local relative paths.

Full datapackage.json

datapackage.json
  resources - are in directory
data
  ...

=> zip that directory ...

"Good" DataPackage - local data etc

datapackage.json
data/
  // only the "primary" data (good csv if csv, excel if an excel etc)
archive/
  // original data ...

=> zipped

Original DataPackage

datapackage.json
# if path was url it remains url, but we do reinline from rawstore
+ whatever files actually came with it in their correct locations

What we want to be zipped? - We have several different options:

Only original source (may be broken csv, json etc..)
Only good data (derived csv, json, etc...)
Original data + good data (derived csv, json etc...)

How should datapackage.json look like?

Has relative paths for resources
Has URL for resources
Both?
What about zip itself? it is one of the resources, and should be included in dp.json that will be exported on S3, but zip itself needs dp.json that includes all the resources to zip them

What about other output formats? Eg sqlite - should it be zipped if both present?

If zip occurred as kind in source-pec we should append new processor dump.to_zip to the pipeline, after add it as resource and export to s3.

Available parameter(s):
out-file: optional #defaults to {dataset-name}.zip

Spec for source

meta:
  owner: <owner username>
  ownerid: <owner unique id>
  dataset: <dataset name>
  version: 1
  findability: <published/unlisted/private>
inputs:
  - 
    kind: datapackage
    url: <datapackage-url>
    parameters:
      resource-mapping:
        <resource-name-or-path>: <resource-url>
outputs:
  -
    kind: zip
    parameter: 
       out-file:: my path

json files end up with ".csv.json" in url

After DataSet is published, if you try to download any file in json format you'll get something like this: https://pkgstore-testing.datahub.io/0e9b59cd50f1884058c1aa242d71a228/finance-vix/latest/.datahub/json/data/vix-daily.csv.json

Plus if you try and open the file it is still CSV.

This also impacts resource paths in datapackage.json and crashes graphs and tables

...
resources: [{
   name: ...
   path: ".datahub/json/data/vix-daily.csv.json"  
}]

Prepending json version of csv resources causing unexpected behaviour in the frontend

As I understood we're prepending json version of each csv resource in a data package. However, in views property of a descriptor we reference resources using initial indexes. So now our 0 resource is not expected csv file but json version of it.

Questions:

Can we append these json versions in the end of resources? So this way it would work as expected.
What if resource's format is json (or geojson, topojson etc.)?

[EPIC] Maintaining and improving Quality of Factory

As a developer I want to have whole system under test so that whenever I make new changes I am sure that nothing else is broken.

As a product owner I want to be sure that site will work fine before some new feature is launched to production, so that our users have a good experience

As a developer I want to be able to add new features quickly which means e.g. I don't have to worry I've broken other stuff or do manual testing on the site all the time so that our users get new features faster

Acceptance criteria

Black box tests for the website covering:
Input/output tests for assembler + dpp so that we have good coverage, avoid regressions and can quickly debug problems
Unit tests for assembler
deploy to testing first and test there before deployed to production (once a day?)

Tasks

Black box tests for assembler + pipelines - #48
Don’t auto-redeploy containers on production #50
Run black box tests on testing constantly #51
redeploy production if all is fine in testing (if nesssarry)

Analysis

Research how to split auto-redeploy for testing and production envs on docker
- https://docs.docker.com/docker-cloud/builds/automated-build/

Change `update_time` to `modified`

Stats are missing in non-tabular packages

We are missing stats (especially bytes) attributes for packages AND for resources, see http://pkgstore.datahub.io/examples/geojson-tutorial/latest/datapackage.json

datopian / assembler Goto Github PK

assembler's People

Contributors

Stargazers

Watchers

Forkers

assembler's Issues

Solution

Acceptance Criteria:

Acceptance croteria

Tasks

Acceptance Criteria

Tasks

Acceptance criteria

Tasks

Analysis

Acceptance criteria

Tasks

Analysis

Spec

Analysis

Acceptance criteria

Tasks

Analysis

Acceptance Criteria

Tasks

Acceptance Criteria

Analysis

Acceptance Criteria

Tasks

Tasks

Acceptance Criteria

Tasks

Analysis

Acceptance criteria

Tasks

analysis

Acceptance criteria

Tasks

Analysis

Acceptance Criteria

Tasks

Acceptance Criteria

Tasks

Acceptance Criteria

Tasks

Acceptance criteria

Tasks

Acceptance Criteria

Tasks

Analysis

Design of the ES index

Desired Queries

Collection

Acceptance criteria

Tasks

Acceptance Criteria

Tasks

Acceptance Criteria

Tasks

Acceptance criteria

Tasks

Acceptance criteria

Tasks

Analysis

Acceptance criteria

Tasks

Analysis

Spec for source

Acceptance criteria

Tasks

Analysis

Recommend Projects

Recommend Topics

Recommend Org