173tech / sayn Goto Github PK

View Code? Open in Web Editor NEW

117.0 117.0 14.0 4.69 MB

Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).

Home Page: https://173tech.github.io/sayn

License: Apache License 2.0

Python 100.00%

analytics automation data-engineering data-extraction data-modeling data-science elt etl python sql

sayn's People

Contributors

Stargazers

Watchers

Forkers

iadrich robin-173 pointgamma173 timofeysugaipov sayaanalam expobrain pyorsquare atkochi hustic heroes-tech jeffamaxey yuanxiaoming8899

sayn's Issues

Improve column type conversion for copy tasks

At the moment we're relying on sqlalchemy data type translations when the columns in copy tasks are not specified, but it doesn't necessarily cover all cases. So the proposed actions here:

Review the current code as the issue could be due to improper use of sqlalchemy
Determine the type of field from the python type of the first record obtained (this requires delaying the table creation)
Have a list of translations as a fallback mechanism

While reviewing this, it might be good to look into #36, particularly if point 2 is done (table creation can be part of loading).

Windows Issue With Environment Variables

Windows makes environment variables upper case, but SAYN expects the same case as the names internally. Example execution on Github Actions: https://github.com/173TECH/sayn/pull/71/checks?check_run_id=1308232133

Ability To Execute AutoSQL/SQL/Copy Task On Non Default Database

Current SQL/AutoSQL/Copy tasks are directed towards the default DB. Add the option to direct those towards other DBs defined in the project. Suggested approach:

SQL tasks: add db entry
AutoSQL tasks: add db entry in destination
Copy tasks: add db entry in destination

DAG image represents dependencies in the reverse order

The dependency graph is reversed at the moment when we produce it with sayn dag-image making parents look like children.

Forbid extra fields in yaml files

The current config for pydantic allows for extra fields to be present when validating, but this can be confusing so we should remove that. To do that, the extra value should be forbid in all our validators (more info: https://pydantic-docs.helpmanual.io/usage/model_config/)

Remove Abort For Incremental Column Wrong Order

For incremental tasks, we currently abort if the DDL column order does not match the column order of the table. Replace that with dynamic column ordering.

Add an example of chaining tasks in sayn documentation

add an example of chaining tasks in sayn documentation

e.g. sayn run -t task_1 -t task_2

Copy Append Only Mode

We need to add support for append only mode with the copy task. This will require a sayn load time field to be added to the loads.

Humanising Errors

Database

Currently, database errors raise exceptions and so the message is just the error produced by the driver. We can improve this in known cases like autosql and copy tasks or with methods of the Database class so that we output a more informative error. Some examples:

In sqlite using load_data, "Too many variables" we could suggest to use max_batch_rows in the credentials
In general errors are too verbose. The main information required is line of error, and the detail of the error. We should not print the query on screen.

Preset

Currently, when you run a task with a preset that doesn't exist it throws this sort of traceback:

Result.Err (get_tasks_dict::task_parsing_error): {'errors': {'dim_customers': Result.Err (get_task_dict::missing_preset): {'group': 'core', 'task': 'dim_customers', 'preset': 'modelling'}

Would be nice to just display "missing_preset in task: dim_customers" instead

DDL & Database API Amends

We need to revamp the DDL so it is more efficient in handling how different databases implement DDLs (e.g. SQLite cannot amend the primary key with an ALTER statement).

Simplify what is exposed to the user in the database API. Keep only the following methods:

load_data
select -> read_data
execute
grant_permissions

Automatically reshape records on load_data

load_data crashes when records are not homogeneous. We can simply reshape the batch before issuing the insert statement to fix it here:

sayn/sayn/database/__init__.py

Line 298 in d1cf36b

connection.execute(table_def.insert().values(data))

Project Initialisation Scaffolding

Integrate the option to initialise different projects:

Barebones project (default sayn init), this should include only the minimum required to get a SAYN project going from scratch.
Example project. Pass an argument to init to specific a specific example project to initialise.
Github project. Initialise a SAYN project based on a github repo.

CLI spinner printer line at end of setup

The DB refactoring changes introduced the following cosmetic issues on the CLI:

setup line of last task to be setup remains printed on screen
for BigQuery, the "Starting run at ..." line is not printed

Feature request: IBM DB2 databases

Would be great to have the possibility to work with DB2 databases. It should be fairly easy to implement.

https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5.0/com.ibm.swg.im.dbclient.python.doc/doc/t0060891.html

Credentials YAML validation

Run a set of YAML validation on the credentials based on the type. For example:

BigQuery expects project param. At the moment a typo would through a non-human friendly error.

Issue on Windows.

There is currently an issue on Windows which prevents SAYN to run. Please see attachments.

Windows Sayn Error Traceback.txt

Rethink Tutorial

We should rethink the flow of the first time user experience for people that get started with SAYN:

Is the current tutorial the most relevant one? Should we switch to something more business friendly? (e.g. ecommerce shop or subscription business).
When revamping tutorial, split each stage of the tutorial based on what we want the user to learn. Maybe introduce each stage by what you will learn as bullet points.
Revamp the init command to be able to set either very light project or the tutorial.

Change test.db to dev.db in tutorial documentation

Separate DDL SQL Functions

Currently we have a few functions which cover multiple DDL statements (e.g. move uses drop).
Separate all those functions so we can use them as independent items.

Documentation changes

The documentation on the copy task is outdated, needs be updated to match 0.4 (e.g. models.yaml no longer exists)

load_data fails when None in the first row

When load_data is called with no ddl information, table structure is determined by the first row. If this row contains None values the type can't be determined and table creation will fail.

AutoSQL Automated Drop For Object State Change

We currently do not automatically drop a table / view if the object was of the other type before.
Implement an automated drop of the object in this case.

View To Table Switch Failing In BigQuery

If a model is a view in BigQuery and is changed to table materialisation, the current process will fail as it does not issue a DROP VIEW. To do:

Fix this bug so the change of materialisation is processed automatically
Check on other databases than BigQuery this is working as expected

Humanising Errors

We need to make the following error messages human friendly.

Running SAYN outside of the project folder: Result.Err (parsing::read_file_error): {'filename': PosixPath('project.yaml')}
An invalid task type (also breaks the spinner): Result.Err (task_type::invalid_task_type_error): {'type': 'nodummy'}
Exceptions print the error twice: first line is the exception message, then comes the traceback and when the message is long it covers the whole screen. Particularly problematic with DBI exceptions

Add SQL detail to compile command on debug mode

Currently sayn compile produces compiled sql files for sql and autosql task types. For autosql tasks the output is only the SELECT statement provided by the user. Copy tasks don't produce any output anymore.
When using sayn compile -d the full SQL to be ran should be produced in the compile folder as it's now in 0.3.0.

load_data improvements

Database.load_data needs some improvements:

The buffer size is set at 50000 records which is a bit random. In some cases like old versions of sqlite the limit should be a lot less and based on amount of data, rather than number of records
There should be an option to create tables. The code to figure out the column types should be consistent with the column translation for copy tasks.
It should support multiple types of input:
- list of dictionary like currently
- iterator, so it can be merged with load_data_stream
- list of tuples with some ddl info for the column names
- pandas dataframe (but shouldn't be a dependency of sayn)
Per database supported, we should implement a more efficient method of loading data like the COPY statement we have for postgresql

Amend command to run / exclude multiple tasks

Currently if running / excluding multiple tasks, you have to chain the flags such as:

sayn run -t t1 -t t1
sayn run -x t1 -x t2

We want to alter this behaviour so all tasks can be added after one flag only:

sayn run -t t1 t2
sayn run -x t1 t2

Automatically import macros in specific folder

Macros in macros folder in sql folder should automatically be imported and made available in the Jinja environment.
Investigate how to make those macros available through the Python tasks.

Fix missing parent in dag error message

The current error message if a parent is not present in any dag is reversed. It seems to be structured as:

"Some parents are missing from dag
In task {parent_task_name}: {task_name}"

It should be the opposite:
"Some parents are missing from dag
In task {task_name}: {parent_task_name}"

Add BigQuery Support

Add BigQuery as a supported database for SAYN

Add spinner to UI

Add a spinner to the new UI in order to reduce the verbosity of usual SAYN runs:

The spinner should contain task number, task name, start time
Then we add info of the current step executing
Once the spinner finished it turns into green tick or red cross depending on success or fail.
If a warning happened, we should keep the warning in the final statements on the console.

Step context manager doesn't handle errors correctly

with self.step('Execute SQL'):
    return self.failed()

With this code, the step is tracked as successful. We need a better way to fail a step when using the context.

BUG Test in view causes only the tested columns to materiliase

This only happens when you specify the test in the config of a model that is materialised as a view in the project.yaml

Database: Snowflake
SAYN version: 0.6.5

Line causing the issue:
{{ config(columns=[{'name': 'payer_id', 'tests': ['unique', 'not_null']}]) }}

BigQuery Documentation Fixes

Currently, the BigQuery part of the documentation does not have an example of implementation which makes it difficult to use. Also, the credentials_path is mislabeled as required when it's not if you have gcloud installed.

Re-running SAYN on init project crashes

There is an issue when re-running SAYN run, please see detail steps below:

sayn init -> works fine
sayn run (1st time) -> works fine
sayn run (2nd time) -> errors

Please see the detail of the error below:

✖ Result.Err (database_error::sql_execution_error): {'exception': OperationalError('error in view f_rankings: no such table: main.f_fighter_results'), 'db': 'warehouse', 'script': 'ALTER TABLE sayn_tmp_dim_tournaments RENAME TO dim_tournaments;\n'}
✖ Result.Err (database_error::sql_execution_error): {'exception': OperationalError('error in view f_rankings: no such table: main.f_fighter_results'), 'db': 'warehouse', 'script': 'ALTER TABLE sayn_tmp_dim_tournaments RENAME TO dim_tournaments;\n'}

Add lock to sayn run

To avoid errors when multiple instances run at the same time, we should include a pid with something like this: https://pypi.org/project/simple-pid/

Copy Task | Failure With Columns But No Type

What:
Copy tasks with columns in the ddl attribute fail if the column types are not specified. The table creation statement are empty (i.e. they do not define the column names / types). See below create_table output:

CREATE OR REPLACE TABLE test_staging.sayn_tmp_xxx
;

Other Info:
Task definition:

    type: copy
    source:
      db: db1
      schema: s1
      table: t1
    destination:
      db: db2
      tmp_schema: ts1
      schema: s2
      table: t2
    ddl:
      columns:
        - c1
        - c2

Change dag concept to group

The name "dag" to separate tasks is bit misleading as it somehow implies SAYN creates multiple DAGs whilst only one is created currently. We will change to calling our concept of "dag" to "group" from now on. The following changes need to be implemented:

Folder "dags" to be renamed "tasks".
Change command sayn run -t dag:x to sayn run -t group:x.
Remove the necessity to define "dags" in "project.yaml". We will now scan for all YAML files in the "tasks" folders.
Change the init project.
Change the documentation. Explain the "group" concept on the "Task" documentation overview.
Review README.md

Load Function Amends

Currently, the load_data method on the database API only inserts. It would be nice to allow the possibility to automatically create the table if it does not exists, append otherwise. A bit similar to the if_exists param from pandas.to_sql.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html

In addition, we should add the ability to accept dataframes, list of lists and list of dicts.

Improve Python Task Errors

Review the overall errors from Python tasks. We need to be able to differentiate easily errors which are caused to SAYN setup (e.g. missing module file, wrong name of module file) or Python code error. List of things to fix:

wrong reference to class in task definition leads to fairly generic error message (e.g. No module named 'sayn_python_tasks.test') -> capture when the file does not exists (bear in mind the class could be defined at multiple levels: Test, test.Test, test.test.Test).
prettify the missing init.py error
make the error message clearer when missing Result object returned (i.e. return self.success()/self.fail()). The current error message is: ✖ Result.Err (task_result::missing_result_error): {'result': None}
review overall python errors

Issue recreating tables in BigQuery

When a table in BigQuery has partitioning or clustering set, a create or replace table that changes that definition will fail. Need to update the logic to perform a DROP TABLE and then a CREATE TABLE instead.

Fix SAYN init unit test

We currently have the sayn init unit test commented, we need to get this re implemented.

Implement Data Testing & DDL Replacement

Implement the data testing feature using a sayn test command with:

Generic tests (unicity, nullity, allowed value).
Custom tests.

As part of this, we will replace the DDL settings with a columns parameters. Generic tests will be defined in this definition. We will introduce a parameter for table properties (e.g. distribution key, etc.) currently set in the DDL and post-hook (will replace the indexes DDL on PostgreSQL for ex).

Add Option For Task To Run Even If Parent Did Not Succeed

As it currently stands, if a task fails skips or fails, its children will be skipped. We should add the ability to control this on the task definitions in the DAG.

Copy task fails for tables with `json` columns

I'm running a copy task with a table users that contains a metadata column with type json.

tasks:
  copy_users:  
    type: copy
    source:
      db: input_db
      table: users
    destination:
      db: output_db
      table: users

And when running the pipeline I get the following error:

INFO|Run Steps: Prepare Load, Load Data, Move Table
INFO|[1/3]  Executing Prepare Load
INFO|[1/3]  Prepare Load (97.8ms)
INFO|[2/3]  Executing Load Data
ERROR|Failed (2s) invalid input syntax for type json
DETAIL:  Token "'" is invalid.
CONTEXT:  JSON data, line 1: {'...
COPY sayn_tmp_users, line 10, column metadata: "{'title': 'Engineer'}"

I've successfully copied other tables that don't have json columns.

Both input and output databases are PostgreSQL, runnning in Python 3.9.13, SAYN version 0.6.5.

Many thanks in advance!

Core Unit Tests

Add further unit testing to test core tasks:

SQL
AutoSQL
Copy

Skipping Message Includes Step Of Failing Parent

When a parent fails and the child is skipped, we currently display the step of when the parent failed it seems, see below error message:

:warning: [7/7] f_rankings (0ms) On step 5/6 Move: Skipping due to parent errors (0ms)

We need to remove the step part which is not relevant as we are skipping the task.