Giter VIP home page Giter VIP logo

sayn's People

Contributors

adrian-173 avatar dependabot[bot] avatar hustic avatar iadrich avatar robin-173 avatar sotos-173 avatar timofeysugaipov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

sayn's Issues

Improve column type conversion for copy tasks

At the moment we're relying on sqlalchemy data type translations when the columns in copy tasks are not specified, but it doesn't necessarily cover all cases. So the proposed actions here:

  • Review the current code as the issue could be due to improper use of sqlalchemy
  • Determine the type of field from the python type of the first record obtained (this requires delaying the table creation)
  • Have a list of translations as a fallback mechanism

While reviewing this, it might be good to look into #36, particularly if point 2 is done (table creation can be part of loading).

Ability To Execute AutoSQL/SQL/Copy Task On Non Default Database

Current SQL/AutoSQL/Copy tasks are directed towards the default DB. Add the option to direct those towards other DBs defined in the project. Suggested approach:

  • SQL tasks: add db entry
  • AutoSQL tasks: add db entry in destination
  • Copy tasks: add db entry in destination

Copy Append Only Mode

We need to add support for append only mode with the copy task. This will require a sayn load time field to be added to the loads.

Humanising Errors

Database

Currently, database errors raise exceptions and so the message is just the error produced by the driver. We can improve this in known cases like autosql and copy tasks or with methods of the Database class so that we output a more informative error. Some examples:

  • In sqlite using load_data, "Too many variables" we could suggest to use max_batch_rows in the credentials
  • In general errors are too verbose. The main information required is line of error, and the detail of the error. We should not print the query on screen.

Preset

Currently, when you run a task with a preset that doesn't exist it throws this sort of traceback:

Result.Err (get_tasks_dict::task_parsing_error): {'errors': {'dim_customers': Result.Err (get_task_dict::missing_preset): {'group': 'core', 'task': 'dim_customers', 'preset': 'modelling'}

Would be nice to just display "missing_preset in task: dim_customers" instead

DDL & Database API Amends

We need to revamp the DDL so it is more efficient in handling how different databases implement DDLs (e.g. SQLite cannot amend the primary key with an ALTER statement).

Simplify what is exposed to the user in the database API. Keep only the following methods:

  • load_data
  • select -> read_data
  • execute
  • grant_permissions

Project Initialisation Scaffolding

Integrate the option to initialise different projects:

  1. Barebones project (default sayn init), this should include only the minimum required to get a SAYN project going from scratch.
  2. Example project. Pass an argument to init to specific a specific example project to initialise.
  3. Github project. Initialise a SAYN project based on a github repo.

CLI spinner printer line at end of setup

The DB refactoring changes introduced the following cosmetic issues on the CLI:

  • setup line of last task to be setup remains printed on screen
  • for BigQuery, the "Starting run at ..." line is not printed

Credentials YAML validation

Run a set of YAML validation on the credentials based on the type. For example:

  • BigQuery expects project param. At the moment a typo would through a non-human friendly error.

Rethink Tutorial

We should rethink the flow of the first time user experience for people that get started with SAYN:

  • Is the current tutorial the most relevant one? Should we switch to something more business friendly? (e.g. ecommerce shop or subscription business).
  • When revamping tutorial, split each stage of the tutorial based on what we want the user to learn. Maybe introduce each stage by what you will learn as bullet points.
  • Revamp the init command to be able to set either very light project or the tutorial.

Separate DDL SQL Functions

  • Currently we have a few functions which cover multiple DDL statements (e.g. move uses drop).
  • Separate all those functions so we can use them as independent items.

Documentation changes

  • The documentation on the copy task is outdated, needs be updated to match 0.4 (e.g. models.yaml no longer exists)

load_data fails when None in the first row

When load_data is called with no ddl information, table structure is determined by the first row. If this row contains None values the type can't be determined and table creation will fail.

View To Table Switch Failing In BigQuery

If a model is a view in BigQuery and is changed to table materialisation, the current process will fail as it does not issue a DROP VIEW. To do:

  • Fix this bug so the change of materialisation is processed automatically
  • Check on other databases than BigQuery this is working as expected

Humanising Errors

We need to make the following error messages human friendly.

  • Running SAYN outside of the project folder: Result.Err (parsing::read_file_error): {'filename': PosixPath('project.yaml')}
  • An invalid task type (also breaks the spinner): Result.Err (task_type::invalid_task_type_error): {'type': 'nodummy'}
  • Exceptions print the error twice: first line is the exception message, then comes the traceback and when the message is long it covers the whole screen. Particularly problematic with DBI exceptions

Add SQL detail to compile command on debug mode

Currently sayn compile produces compiled sql files for sql and autosql task types. For autosql tasks the output is only the SELECT statement provided by the user. Copy tasks don't produce any output anymore.
When using sayn compile -d the full SQL to be ran should be produced in the compile folder as it's now in 0.3.0.

load_data improvements

Database.load_data needs some improvements:

  • The buffer size is set at 50000 records which is a bit random. In some cases like old versions of sqlite the limit should be a lot less and based on amount of data, rather than number of records
  • There should be an option to create tables. The code to figure out the column types should be consistent with the column translation for copy tasks.
  • It should support multiple types of input:
    • list of dictionary like currently
    • iterator, so it can be merged with load_data_stream
    • list of tuples with some ddl info for the column names
    • pandas dataframe (but shouldn't be a dependency of sayn)
  • Per database supported, we should implement a more efficient method of loading data like the COPY statement we have for postgresql

Amend command to run / exclude multiple tasks

Currently if running / excluding multiple tasks, you have to chain the flags such as:

  • sayn run -t t1 -t t1
  • sayn run -x t1 -x t2

We want to alter this behaviour so all tasks can be added after one flag only:

  • sayn run -t t1 t2
  • sayn run -x t1 t2

Automatically import macros in specific folder

  • Macros in macros folder in sql folder should automatically be imported and made available in the Jinja environment.
  • Investigate how to make those macros available through the Python tasks.

Fix missing parent in dag error message

The current error message if a parent is not present in any dag is reversed. It seems to be structured as:

"Some parents are missing from dag
In task {parent_task_name}: {task_name}"

It should be the opposite:
"Some parents are missing from dag
In task {task_name}: {parent_task_name}"

Add spinner to UI

Add a spinner to the new UI in order to reduce the verbosity of usual SAYN runs:

  • The spinner should contain task number, task name, start time
  • Then we add info of the current step executing
  • Once the spinner finished it turns into green tick or red cross depending on success or fail.
  • If a warning happened, we should keep the warning in the final statements on the console.

BUG Test in view causes only the tested columns to materiliase

This only happens when you specify the test in the config of a model that is materialised as a view in the project.yaml

Database: Snowflake
SAYN version: 0.6.5

Line causing the issue:
{{ config(columns=[{'name': 'payer_id', 'tests': ['unique', 'not_null']}]) }}

BigQuery Documentation Fixes

Currently, the BigQuery part of the documentation does not have an example of implementation which makes it difficult to use. Also, the credentials_path is mislabeled as required when it's not if you have gcloud installed.

Re-running SAYN on init project crashes

There is an issue when re-running SAYN run, please see detail steps below:

  • sayn init -> works fine
  • sayn run (1st time) -> works fine
  • sayn run (2nd time) -> errors

Please see the detail of the error below:

✖ Result.Err (database_error::sql_execution_error): {'exception': OperationalError('error in view f_rankings: no such table: main.f_fighter_results'), 'db': 'warehouse', 'script': 'ALTER TABLE sayn_tmp_dim_tournaments RENAME TO dim_tournaments;\n'}
✖ Result.Err (database_error::sql_execution_error): {'exception': OperationalError('error in view f_rankings: no such table: main.f_fighter_results'), 'db': 'warehouse', 'script': 'ALTER TABLE sayn_tmp_dim_tournaments RENAME TO dim_tournaments;\n'}

Copy Task | Failure With Columns But No Type

What:
Copy tasks with columns in the ddl attribute fail if the column types are not specified. The table creation statement are empty (i.e. they do not define the column names / types). See below create_table output:

CREATE OR REPLACE TABLE test_staging.sayn_tmp_xxx
;

Other Info:
Task definition:

    type: copy
    source:
      db: db1
      schema: s1
      table: t1
    destination:
      db: db2
      tmp_schema: ts1
      schema: s2
      table: t2
    ddl:
      columns:
        - c1
        - c2

Change dag concept to group

The name "dag" to separate tasks is bit misleading as it somehow implies SAYN creates multiple DAGs whilst only one is created currently. We will change to calling our concept of "dag" to "group" from now on. The following changes need to be implemented:

  • Folder "dags" to be renamed "tasks".
  • Change command sayn run -t dag:x to sayn run -t group:x.
  • Remove the necessity to define "dags" in "project.yaml". We will now scan for all YAML files in the "tasks" folders.
  • Change the init project.
  • Change the documentation. Explain the "group" concept on the "Task" documentation overview.
  • Review README.md

Improve Python Task Errors

Review the overall errors from Python tasks. We need to be able to differentiate easily errors which are caused to SAYN setup (e.g. missing module file, wrong name of module file) or Python code error. List of things to fix:

  • wrong reference to class in task definition leads to fairly generic error message (e.g. No module named 'sayn_python_tasks.test') -> capture when the file does not exists (bear in mind the class could be defined at multiple levels: Test, test.Test, test.test.Test).
  • prettify the missing init.py error
  • make the error message clearer when missing Result object returned (i.e. return self.success()/self.fail()). The current error message is: ✖ Result.Err (task_result::missing_result_error): {'result': None}
  • review overall python errors

Issue recreating tables in BigQuery

When a table in BigQuery has partitioning or clustering set, a create or replace table that changes that definition will fail. Need to update the logic to perform a DROP TABLE and then a CREATE TABLE instead.

Implement Data Testing & DDL Replacement

Implement the data testing feature using a sayn test command with:

  • Generic tests (unicity, nullity, allowed value).
  • Custom tests.

As part of this, we will replace the DDL settings with a columns parameters. Generic tests will be defined in this definition. We will introduce a parameter for table properties (e.g. distribution key, etc.) currently set in the DDL and post-hook (will replace the indexes DDL on PostgreSQL for ex).

Copy task fails for tables with `json` columns

I'm running a copy task with a table users that contains a metadata column with type json.

tasks:
  copy_users:  
    type: copy
    source:
      db: input_db
      table: users
    destination:
      db: output_db
      table: users

And when running the pipeline I get the following error:

INFO|Run Steps: Prepare Load, Load Data, Move Table
INFO|[1/3]  Executing Prepare Load
INFO|[1/3]  Prepare Load (97.8ms)
INFO|[2/3]  Executing Load Data
ERROR|Failed (2s) invalid input syntax for type json
DETAIL:  Token "'" is invalid.
CONTEXT:  JSON data, line 1: {'...
COPY sayn_tmp_users, line 10, column metadata: "{'title': 'Engineer'}"

I've successfully copied other tables that don't have json columns.

Both input and output databases are PostgreSQL, runnning in Python 3.9.13, SAYN version 0.6.5.

Many thanks in advance!

Core Unit Tests

Add further unit testing to test core tasks:

  • SQL
  • AutoSQL
  • Copy

Skipping Message Includes Step Of Failing Parent

When a parent fails and the child is skipped, we currently display the step of when the parent failed it seems, see below error message:

:warning: [7/7] f_rankings (0ms) On step 5/6 Move: Skipping due to parent errors (0ms)

We need to remove the step part which is not relevant as we are skipping the task.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.