Giter VIP home page Giter VIP logo

assembler's People

Contributors

akariv avatar zelima avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

assembler's Issues

Parallel pipelines

Problem: you republish of the dataset and it will take a few minutes even if only metadata changed.

Split the pipeline into: metadata and data pipelines

pkgstore files should have file type of text/plain

Copying from openknowledge-archive/dpr-api#273

When we upload any file in pkg store the file type by default set to binary/octet-stream. But we want text/plain:

e.g. datapackage.json has binary/octet-stream should be json. Try:

curl -I http://pkgstore-testing.datahub.io/core/finance-vix/latest/datapackage.json

Why is this a problem:

  • when you click on Metadata in the views page the datapackage.json downloads rather than being viewable in the browser.
  • if you view these files you get auto-download. Like github raw your want text/plain (this is also good for security)

Want right content type for JSON and for everything else set text/plain (or could even just have text/plain for json too).

Solution

Add Content-Type while publishing data to S3

Acceptance Criteria:

  • In pkgstore all data should be of type text/plain

Black box tests for the website

Acceptance croteria

  • page with views, without views, large, small, (non-tabular?) are tested

Tasks

  • Choose a few pages that we want to always work
  • Verify that pages load in a reasonable time
    • homepage - < 1sec
    • showcase page - < 15 sec
  • Verify that download links work
  • Verify that downloaded data makes sense
  • Verify that views appear correctly

Processors for getting preview

We need two new processors to be able to have preview resources - processors for getting first N rows from given resource and processor for adding views to the datapackage.json

parameters: limit (will default to 10000)

Acceptance Criteria

  • after the processors run we have sperate resource with only N number of rows

Tasks

  • add processors that yields only N (or 10000) rows
  • add processors that creates (or modifies) views
  • tests

Data Update is not happening with scheduled jobs

One of our core datasets finance-fix that should be updated daily, is stuck on 2nd of October - the scheduled job is running but data is never updated.

Acceptance criteria

  • we have the lates data after scheduled run

Tasks

  • reproduce locally
  • investigate reason (see analysis)
  • fix

Analysis

This is happening cause only "main" pipeline is running on scheduled time and derived ones remain same

Export resource(s) to sqlite

We need to export data into a sqlite if requested

Acceptance criteria

  • There is SQLite file on S3

Tasks

  • do analysis
  • function for generating processor (to append to pipeline list)
  • edit source spec generator

Analysis

Parameters:

  • file-name: optional #defaults to <>.db
  • resource-names: [resource-one, reource-two] # required
  • table names will be same as resources names (with underscores _)
  • mode will always be rewrite (always creates new tables)
    • other options would be appen (append new rows) and update (update if row exists). but let's keep it simple

path on S3: data/sqlite/data/{resourcename}.db

Spec

# request body from CLI
{
  ...
  kind: sqlite
}

Analysis

meta:
  owner: <owner username>
  ownerid: <owner unique id>
  dataset: <dataset name>
  version: 1
  findability: <published/unlisted/private>
inputs:
  -  # only one input is supported atm
    kind: datapackage
    url: <datapackage-url>
    parameters:
      resource-mapping:
        <resource-name-or-path>: <resource-url>
outputs:
  -
    kind: sqlite

when sqlite is in outputs, we need to add two processors:

  • dump.to_sql into a temporary file
  • add_resource to add that resource to the datapackage (with the proper path and datahub type to indicate it’s a derivative of which resource)
# pipeline-spec
meta:
  ...
inputs:
  - 
    kind: datapackage
    ...
outputs: 
  -
    kind: sqlite

# generator.py in assambler
pipeline = [current_pipeline]
for output in outputs:
    if output[kind] == 'sqlite':
          pipeline.append({run: dump.to_sql, parameters: {engine: lsqlite:///}}})
    etc..

yield pipeline_id, {pipeline: pipeline}

Questions:

What should be path for it?

BBTest for private dataset

We are introducing private datasets for users. We need to test the pipelines executed fine and links are returning 403

Not all pipelines are executed after they are finished

When developing locally and running all pipeline, not all the, are executed - no errors, no warnings they are marked as successful, but actually some of them are not executed at all. (I strongly believe this is happening not only locally, but in production as well - see second comment).

Acceptance criteria

  • all pipelines should be executed

Tasks

  • analysis
  • compe up with solution
  • fix

Analysis

Recently we've implemented new pipeline that generates preview resource. This pipeline cannot be executed until derived/csv. Also, cannot be executed "the main" pipeline that is depended on all others are finished. So

  • When I execute pipelines for the first time (dpp run all) nontabular plus all derived ones plus successfully executed (including preview), there are no error messages or warnings. However, there is no the final one (the one that should contain all together plus original)
  • After executing same exact command for second time (dpp run all) now that final one gets there and we have complete dataset

For example, running dpp list all pipelines with their errors. running dpp for first time (non of pipleines are executed yet) Logs:

Available Pipelines:
INFO    :Main                            :Skipping redis connection, host:None, port:6379
- ./core/finance-vix:non-tabular (*)
- ./core/finance-vix:vix-daily_csv (*)
- ./core/finance-vix:vix-daily_csv_preview (E)
    Dirty dependency: Cannot run until all dependencies are executed
    Dependency unsuccessful: Cannot run until all dependencies are successfully executed
- ./core/finance-vix:vix-daily_json (*)
- ./core/finance-vix (E)
    Dirty dependency: Cannot run until all dependencies are executed
    Dependency unsuccessful: Cannot run until all dependencies are successfully executed
    Dirty dependency: Cannot run until all dependencies are executed
    Dependency unsuccessful: Cannot run until all dependencies are successfully executed

Running dpp for second time (after first dpp run all finishes):

Available Pipelines:
INFO    :Main                            :Skipping redis connection, host:None, port:6379
- ./core/finance-vix:non-tabular 
- ./core/finance-vix:vix-daily_csv 
- ./core/finance-vix:vix-daily_csv_preview (*)
- ./core/finance-vix:vix-daily_json 
- ./core/finance-vix (E)
    Dirty dependency: Cannot run until all dependencies are executed
    Dependency unsuccessful: Cannot run until all dependencies are successfully executed

Running dpp now (after second dpp run all finished) - now everything is fine

Available Pipelines:
INFO    :Main                            :Skipping redis connection, host:None, port:6379
- ./core/finance-vix:non-tabular 
- ./core/finance-vix:vix-daily_csv 
- ./core/finance-vix:vix-daily_csv_preview 
- ./core/finance-vix:vix-daily_json 
- ./core/finance-vix (*)

how dpp run all works

  • checks for available pipelines -> iterates over them -> checks if pipeline has errors or already executed (skips if so) -> executes -> updates pipeline's status dirty=False
  • If at least one is executed from above -> again checks for all pipelines -> iterates -> checks for errors/already executed -> executes -> updates
  • does same until all of them executed -> nothing executed / no more dirty -> Finish

in our case, two pipelines have dependencies: derived_preview and main pipeline. So when iterating it goes like this:

  1. Take a look at all of them -> skip preview and main for now -> execute others -> change dirty=False
  2. Take a look at all of them -> skip main (as preview is still dirty) and skip all others (as they are already executed) -> execute preview -> change dirty=False for some reason this canhge is not happening
  3. take a look at all of them -> skip main again (as preview is still marked as dirty) -> nothing is executed -> Finish

I can not find the reason why status of successfully executed pipeline is still dirty for final iteration.

NOTE: this is not happening if you run dpp run dirty. All pipelines are executed fine in that case

problems with exported zip files

Feedback after reviewing export to zip.

  • zip file should be named after dataset name
  • zipped json has wrong encoding
  • when generating zip resources have descriptions saying what is it
  • All datasets by default should have zip resource

Acceptance Criteria

  • able to read json files after unzip
  • named after datasets and with small description
  • all datasets hacve zip resource

Tasks

  • set out-file ot dataset-name.zip instead of datahub.zip [0.5]
  • reproduce zip with wrong encoding for json locally and debug [3]
  • define (find out) what should description say and upade [0.5]
  • refactor code so that all datasets have ziped resource [1]

Pipelines are generated according to old/cached source-spec

I Have pushed a new source spec for one of the core datasets using specstore API.

  • API responds with success=true
  • manually checked database and new spec is there
  • Assembler sees that something changed in DB and triggers the planner
  • However dashboard shows the "old" source-spec and pipelines are run according to that
  • Result: No changes to my dataset

Acceptance Criteria

  • pipelines are run according to newest source-spec pushed

Analysis

The source spec I pushed (and currently in DB):

meta:
  dataset: finance-vix
  findability: published
  owner: core
  ownerid: core
  version: 2
inputs:
- kind: datapackage
  parameters:
    resource-mapping:
      vix-daily: http://www.cboe.com/publish/ScheduledTask/MktData/datahouse/vixcurrent.csv
  url: https://raw.githubusercontent.com/datasets/finance-vix/master/.datahub/datapackage.json
processing:
  -
    input: vix-daily
    tabulator:
      skip_rows: 2
      headers:
        - Date
        - VIXOpen
        - VIXHigh
        - VIXLow
        - VIXClose
    output: vix-daily
schedule:
    crontab: '0 0 * * *'

Source spec from dashboard http://api.datahub.io/pipelines/#anchor-ALL-core-finance-vix

inputs:
- kind: datapackage
  parameters:
    resource-mapping:
      vix-daily: https://s3.amazonaws.com/rawstore.datahub.io/9c46b3948d297fa1c4e92ead714f0399
  url: https://s3.amazonaws.com/rawstore.datahub.io/66387913a43fc2a04ca602ce0d529b1c
meta:
  dataset: finance-vix
  findability: unlisted
  owner: core
  ownerid: core
  update_time: '2017-09-26T10:02:32.931873'
  version: 1

generator to read pipelines from the flowmanager db.

Currently, we are using SourceSpecRegistry to read from DB what pipelines need to be run. Since we extracted planning part of assembling and changed the flow management, SourceSpecRegistry is not useful anymore.

Acceptance Criteria

  • generator reads from new db and runs pipelines accordingly

Tasks

  • use FlowManager as dependency and query the DB
  • integrate with BB tests

2 core data packages are not passed pipeline

We have 2 core data packages that have not passed pipeline. The problem is geopoint type.
In the first time, it errored with invalid geopoint type. Then I fixed both of them. Now we have a different error, probably with jsontableschema library.

Error outcome from pipeline:

AttributeError: 'list' object has no attribute 'split'

Dump to pkgstore should extract readme and push to README.md

It seems best way to handle README is to store inline in datapackage.json in rawstore.

This means assembler pipeline must extract readme from readme property and dump to README.md (and delete that property from the datapackage.json) when dumping to the pkgstore.

Report events to ES: WHO did WHAT WHEN

Tasks

  • do analysis how to gather needed info
  • do analysis how to put this on elastic search
  • Check following:
    • [Flow] Push completed
      • Success / Failure
    • Dataset created (special case of push - where push is a new dataset)
    • [Flow] Push started (time elapsed to success can be ca)
    • Download (clicks, GET on data file)
    • User sign up
    • User sign in
    • (Page view)
      • Search of term

Deal with generating first/last XXX rows for preview resources

how we are going to deal with sorting data that is time sensitive. Eg: for Finance Vix, that is updated daily, we need last 200 rows (or maybe first for other datasets), to display the latest data

Acceptance Criteria

  • we have latest data for preview

Tasks

  • List possible solutions
  • Choose best for us
  • implement

Analysis

  • One solution for this would be including some info in views Eg: {desc: true/false}
  • Another option would be to include that info inside processing object of source-spec

cc @rufuspollock and @akariv what do you think about this?

Preview large datasets

As a Publisher I want to upload a 50Mb CSV and have the showcase page work

Acceptance criteria

  • We have preview version of all resources

Tasks

  • read refactored code and understand how it works
  • add new custom processor that will iteration stop at 10000 (in any case - even data is < 50mb) #23
    • Number should be configurable. Also this should come from the views right - doesn't
  • add new custum processor that will modify dp.json and add views #23
  • add new Processing node for Preview in basic_nodes.py #22
  • update node_collector.py with new processing node
  • refactore load_modified_resources.py to handle dp.json with views

analysis

We will need two new processors: generate_views (for adding the views) and one that just takes derived resources (json) and yields only first 10000 rows.

Modifications for basic_nodes.py:

class DerivedFormatProcessingNode(BaseProcessingNode):

  def get_artifacts(self):
    for artifact in self.available_artifacts:
      if artifact.datahub_type == 'derived/json':  #or csv
        datahub_type = 'derived/preview' 
        resource_name = artifact.resource_name + '_preview.json'
        output = ProcessingArtifact(
        datahub_type, resource_name,
        [artifact], [],
        [(new processors + old goes here)],
                True
            )
            yield output

class DerivedPreviewProcessingNode(...):
  def __init__(self):
    ... # (all same)

Inside node_collector.py we will have to add DerivedPreviewrocessingNode in the appropriate order. (should come after derived json as it is depended on that)

And we will have to modify load_modified_resources.py to handle "views" as the views are not yet supported there.

Example datapackage.json after running pipelines:

{
  "resources": [
    {
      "name": "vix-daily",
      "path": "data/vix-daily.csv",
      "format": "csv",
      "mediatype": "text/csv",
      "schema": {...}
    },
    {}, // derived csv
    {}, // derived json
    { // even though json data we have a table schema
      "name": "vix-daily_csv_preview",
      "path": "data/preview/vix-daily.json",
      "datahub": {
        "derivedFrom": [
          "vix-daily_csv"
        ],
        "forView": [
          "datahub-preview-vix-daily_csv_preview"
        ],
        "type": "derived/preview" // data will be json ...
      }
    }
  ],
  "views": [
    {
      "name": "graph",
      "title": "VIX - CBOE Volatility Index",
      "specType": "simple",
      "spec": {
        "type": "line",
        "group": "Date",
        "series": [
          "VIXClose"
        ]
      }
    },
    {
      name: 'datahub-preview-vix-daily_csv_preview',
      specType: 'table',
      datahub: {
        type: 'preview'
      },
      transform: {
        limit: 10000
      },
      resources: [
        'vix-daily_csv_preview'
      ]
    }
  ]
}

Processing node for preview datasets

We need new processing node to define one more processing flow (preview datasets). For that, we need to implement a class inheriting from BaseProcessingNode, and add that class name to the ORDERED_NODE_CLASSES array. new processing node will iterate over all resources and add processors that are generating first 10000 rows and views to the ones that have type derived/json

Acceptance criteria

  • after able to run load_previews and load_views from #23

Tasks

  • define new class for preview processing node class DerivedPreviewProcessingNode that will inherit DerivedFormatProcessingNode
  • in the DerivedPreviewProcessingNode override get_artifacts and Iterate over existing artefacts (resources)
  • check for the ones that are derived/json (or csv?)
  • and pass the processors that we will need to get the desired resources and views.
  • add DerivedPreviewProcessingNode to the ORDERED_NODE_CLASSES in correct position (last in the list)

Analysis

Example how refactored basic_nodes.py may look like

class DerivedFormatProcessingNode(BaseProcessingNode):

  def get_artifacts(self):
    for artifact in self.available_artifacts:
      if artifact.datahub_type == 'derived/csv':
        datahub_type = 'derived/preview' 
        resource_name = artifact.resource_name + '_preview.'
        output = ProcessingArtifact(
        datahub_type, resource_name,
        [artifact], [],
        [(new processors + old goes here)],
                True
            )
            yield output

class DerivedPreviewProcessingNode(...):
  def __init__(self):
    ... # (all same)

Files in Exported zip files are placed in "hash" folders

If you try and download zip files, data inside is placed in following structure

|- data/data.csv
|- data/filehash1/data_csv.csv
|- data/filehash2/data_json.csv
|- datapackage.json

We don't need hashes

Acceptance Criteria

  • no "hash folders" inside zip

Tasks

  • create new processor that removes hashes form file paths
  • update planner to run that processor before dump.to_zip in outputNodes

Split the non-tabular node to generate one item per resource

Currently, we are copping across non-tabular resources, without any modification to dp.json. We need to change this behaviour as in some cases we have to reuse and add it as a new resource to the dp.json. add_resource processor searches the resource in dp.json with its name and as non-tabular ones are copied across, resource names are copied from originals. So we are getting AssertionError: Failed to find a resource with the index or name matching 'non-tabular' as a planner is trying to add resource using artefact name (add_resource, {resource: resource_info[required_artifact.resource_name]}...)

Acceptance Criteria

  • generate one item per resource for non-tabular ones as well

Tasks

  • refactor NonTabularProcessingNode in basic_nodes.py to yield artifact with correct resource names {resource name}_non_tabular

Put events on Elastic Search

Besides metadata, we want to store events on Elastic Search

Acceptance Criteria

  • able to query the existing Elastic Search Eg: with Kibana for now (later with API) and get successful push events

Tasks

  • create new index "events"
  • for now save only sucsessfull executions
  • Design index in the following way
{
  "timestamp": ...
  // type of event
  "event_entity": "flow|account|..."
  "event_action": "create/finished/deleted/...
  // event filters
  "owner": ...
  "dataset": ... // may not always be there ...
  // results
  "outcome": "ok/error", or "status": "good/bad/ugly",
  "messsage": ... // mostly blank
  "findability": ...
  "payload": {
    // anything you want add - customizing the event info per object
    flow-id (aka pipeline-id)
  }
}

Pretty print datapackage.json (?)

When writing datapackage.json to pkgstore pretty print json so it is nice to read for users (this is convenient when directly browsing).

Not sure this is that important ;-)

Resource without a format field results in a crash of pipeline

stream_remote_resources: WARNING :Error while opening resource from url https://s3.amazonaws.com/rawstore-testing.datahub.io/sQqpgDlCdaDFdRjzxbZN9Q==: FormatError('Format "None" is not supported',)
stream_remote_resources: Traceback (most recent call last):
stream_remote_resources:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/specs/../lib/stream_remote_resources.py", line 144, in opener
stream_remote_resources:     _stream.open()
stream_remote_resources:   File "/usr/local/lib/python3.6/site-packages/tabulator/stream.py", line 132, in open
stream_remote_resources:     raise exceptions.FormatError(message)
stream_remote_resources: tabulator.exceptions.FormatError: Format "None" is not supported
stream_remote_resources: During handling of the above exception, another exception occurred:
stream_remote_resources: Traceback (most recent call last):
stream_remote_resources:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/specs/../lib/stream_remote_resources.py", line 201, in 
stream_remote_resources:     rows = stream_reader(resource, url, ignore_missing or url == "")
stream_remote_resources:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/specs/../lib/stream_remote_resources.py", line 159, in stream_reader
stream_remote_resources:     schema, headers, stream, close = get_opener(url, _resource)()
stream_remote_resources:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/specs/../lib/stream_remote_resources.py", line 153, in opener
stream_remote_resources:     _stream.close()
stream_remote_resources:   File "/usr/local/lib/python3.6/site-packages/tabulator/stream.py", line 156, in close
stream_remote_resources:     self.__parser.close()
stream_remote_resources: AttributeError: 'NoneType' object has no attribute 'close'

Why pkgstore URL's have ":" in the middle rather than "/"

If you take a look at the paths of the resources in the datapakcage.json they look like this: "https://pkgstore.datahub.io/core/finance-vix:non-tabular/data/vix-daily.csv"

The structure on the S3 is the same:

| pkgstore.datahub.io
--| core
--|--| finance-vix:non-tabular
--|--|--| data
--|--|--|--| vix-daily.csv

@akariv Is this done on purpose or there should be / and that is a bug?
Example dp.json: https://pkgstore.datahub.io/core/finance-vix/latest/datapackage.json

Acceptance criteria

  • we know the reason why it is there
  • if that is a bug repalce with /

Tasks

  • investigate reason
  • fix if neccssary
  • rerun assembler

[EPIC] Event store so that we can measure statistical quality of factory and frontend

As an admin, I want to record most activities (sign up, sign in, push etc) so that I'm know how frequently users are using platform and how often they achieve expected results

As a Publisher, I want to see what has happened recently with my data (and in future what my team has done) so that I know things are working, what's changed etc

As a Viewer I want to see what a Publisher has been up to so that I know whether they are active, what's new and what's hot (so that ...)

Acceptance Criteria

  • We have easily measurable statistics

Tasks

  • Report events to ES: WHO did WHAT WHEN #52
    • do analysis how to gather needed info
    • do analysis how to put this on elastic search
  • Record push completion (always successful)
  • Display push completion in dashboard (?)
  • Record push fail
  • Record push start
  • Record ...
  • Write blog post about new dashboard

Analysis

In order to do statistical analyses we need to record most possible activities on the platform. Good place for storing info would be the elastic search

What are the events?

  • [Flow] Push completed [assembler]
    • Success / Failure
  • Dataset created (special case of push - where push is a new dataset) [spec store?]
  • [Flow] Push started (time elapsed to success can be ca) [assembler]
  • Download (clicks, GET on data file) [Google Analytics]
  • User sign up (happens through out system)
  • (?) User sign in to frontend (?) (happens through our system)
  • (Page view) [Google Analytics]
    • Search of term [Google Analytics]

Design of the ES index

{
  "timestamp": ...
  // type of event
  "event_entity": "flow|account|..."
  "event_action": "create/finished/deleted/...
  // event filters
  "owner": ...
  "dataset": ... // may not always be there ...
  // results
  "outcome": "ok/error", or "status": "good/bad/ugly",
  "messsage": ... // mostly blank
  "findability": ...
  "payload": {
    // anything you want add - customizing the event info per object
    flow-id (aka pipeline-id)
  }
}
{
  "timestamp":
  "owner":
  "payload": ...
}

Desired Queries

For user dashboard:

SELECT * FROM events where event.owner = userid SORT DESC timestamp;

SELECT * FROM events where ... AND

For user public profile

SELECT * FROM events where event.owner = profilid AND findability == 'public' AND type != 'login'

Internally

SELECT count(*) FROM events where type = "dataset/push" AND outcome == "failed"

Collection

  • should a collection of data happen daily, or on every event?
    • Depends on where we are getting data from but preferably on every event (push style). For some external data e.g. downloads we may need to pull on a regular basis.
  • How to automate this?
  • how to gather as they are split across different services/API?

Derived preview resources are too large

Recently, we have implemented generating derived/preview resources, which are first 10k rows of the original resource. It was decided to make it in JSON format for convenience in the frontend. However, as we can see now, JSON versions usually are larger in size than CSV versions. E.g., for finance-vix dataset, original resource is ~228KB, while derived/preview version is ~528KB. It becomes crucial when loading resources that are several MB in sizes.

Another point is we could skip generating derived/preview if the number of rows is <10k as it does not make a lot of sense or am I missing something? Frontend would use original resource for the preview if derived/preview is not found. Notice, that we have to generate preview views anyway.

In addition, we could reduce number of rows for preview resources, e.g., only first 1k rows. Consider https://pkgstore.datahub.io/f512ef8d39ada702fde60efe1ca59c17/farm-url/latest/datapackage.json which has a resource with 4987 rows . What do you think?

Tests are failing cause of incorrect DB setup

We have failing test on Travis. There are multiple revisions saved while should be one per each dataset. This is happening cause we are erasing data only ones from db (on the start) - we need to do that after each test

Acceptance criteria

  • Travis is passing

Tasks

  • delete from tables after each test

Test for granular output types

@zelima commented on Thu Nov 09 2017

Depending on config passed to planner we have a different kind of outputs. We need to test all of them

Acceptance Criteria

  • tested most of the possible outputs

Tasks

  • produces a datapackage.json (does nothing) - (keeps original resources)
    • Planner that does basically nothing - just does the datapackage.json copy across ...
  • produces just (derived) csv
  • produces csv, json
  • produces csv, json, preview
  • produces csv, json, preview, zip

Pretty print datapackage.json

As a user, I want to see datapackage.json prettified other than have it on one line, so that it is easy for me to read it

Acceptance Criteria

  • datapackage,json is pretty printed

Tasks

  • Do analysis: explore code where json resources are generated
  • refactor to write prettified version of json

Black box tests for assembler + pipelines

Acceptance criteria

  • covers most cases of spec - regular, remote, zip, etc...

Tasks

  • Choose a few sample flow specs that we want to always work
  • Run them as part of the test suite using dpp
  • Verify the outputs

Getting too many connections when running the pipelines

RDS is getting too many connections, cause each new FlowRegistry object in Generator creates a new connection to RDS. One such object should suffice. (see analysis)

Acceptance criteria

  • RDS is not shutting down case of too many connections

Tasks

  • Move

Analysis

current state

class Generator(GeneratorBase):

    @classmethod
    def get_schema(cls):
        return json.load(open(SCHEMA_FILE))

    @classmethod
    def generate_pipeline(cls, source):
        registry = FlowRegistry(DB_ENGINE)
        count = 0
        for pipeline in registry.list_pipelines():  # type: Pipelines
            yield pipeline.pipeline_id, pipeline.pipeline_details
            count += 1
        logging.error('assember sent %d pipelines', count)

Solution

REGISTRY = FlowRegistry(DB_ENGINE)
class Generator(GeneratorBase):

    @classmethod
    def get_schema(cls):
        return json.load(open(SCHEMA_FILE))

    @classmethod
    def generate_pipeline(cls, source):
        count = 0
        for pipeline in REGISTRY.list_pipelines():  # type: Pipelines
            yield pipeline.pipeline_id, pipeline.pipeline_details
            count += 1
        logging.error('assember sent %d pipelines', count)

Export datapackage to zip

We need to export data into a zip if requested

Acceptance criteria

  • There is zip file on S3

Tasks

  • do analysis
  • grab all derived + source + non-tabular (if exists) resources in one dp.json
    • Allow loading of derived non-streamed dependencies - fixed in b3c8593
  • Split the non-tabular node to generate one item per resource, so we can have them in dp.json #40
  • zip in temp directory
    • find the way how to zip in temp directory and not have pipeline always dirty (as tmpdir is always new)
  • processor for setting correct paths to resources
  • add zip as another resource
    • processor to erase reources from dp.json of zip

Analysis

NTS: i think we are starting to want a method that "in-directories" data: given a datapackage.json it downloads it, retrieves all data files converting path urls into local relative paths.

Full datapackage.json

datapackage.json
  resources - are in directory
data
  ...

=> zip that directory ...

"Good" DataPackage - local data etc

datapackage.json
data/
  // only the "primary" data (good csv if csv, excel if an excel etc)
archive/
  // original data ...

=> zipped

Original DataPackage

datapackage.json
# if path was url it remains url, but we do reinline from rawstore
+ whatever files actually came with it in their correct locations

What we want to be zipped? - We have several different options:

  • Only original source (may be broken csv, json etc..)
  • Only good data (derived csv, json, etc...)
  • Original data + good data (derived csv, json etc...)

How should datapackage.json look like?

  • Has relative paths for resources
  • Has URL for resources
  • Both?
  • What about zip itself? it is one of the resources, and should be included in dp.json that will be exported on S3, but zip itself needs dp.json that includes all the resources to zip them

What about other output formats? Eg sqlite - should it be zipped if both present?


If zip occurred as kind in source-pec we should append new processor dump.to_zip to the pipeline, after add it as resource and export to s3.

Available parameter(s):
out-file: optional #defaults to {dataset-name}.zip

Spec for source

meta:
  owner: <owner username>
  ownerid: <owner unique id>
  dataset: <dataset name>
  version: 1
  findability: <published/unlisted/private>
inputs:
  - 
    kind: datapackage
    url: <datapackage-url>
    parameters:
      resource-mapping:
        <resource-name-or-path>: <resource-url>
outputs:
  -
    kind: zip
    parameter: 
       out-file:: my path

Prepending json version of csv resources causing unexpected behaviour in the frontend

As I understood we're prepending json version of each csv resource in a data package. However, in views property of a descriptor we reference resources using initial indexes. So now our 0 resource is not expected csv file but json version of it.

Questions:

  • Can we append these json versions in the end of resources? So this way it would work as expected.
  • What if resource's format is json (or geojson, topojson etc.)?

[EPIC] Maintaining and improving Quality of Factory

As a developer I want to have whole system under test so that whenever I make new changes I am sure that nothing else is broken.

As a product owner I want to be sure that site will work fine before some new feature is launched to production, so that our users have a good experience

As a developer I want to be able to add new features quickly which means e.g. I don't have to worry I've broken other stuff or do manual testing on the site all the time so that our users get new features faster

Acceptance criteria

  • Black box tests for the website covering:
  • Input/output tests for assembler + dpp so that we have good coverage, avoid regressions and can quickly debug problems
  • Unit tests for assembler
  • deploy to testing first and test there before deployed to production (once a day?)

Tasks

  • Black box tests for assembler + pipelines - #48
  • Don’t auto-redeploy containers on production #50
  • Run black box tests on testing constantly #51
  • redeploy production if all is fine in testing (if nesssarry)

Analysis

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.