173tech / sayn Goto Github PK
View Code? Open in Web Editor NEWData processing and modelling framework for automating tasks (incl. Python & SQL transformations).
Home Page: https://173tech.github.io/sayn
License: Apache License 2.0
Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).
Home Page: https://173tech.github.io/sayn
License: Apache License 2.0
At the moment we're relying on sqlalchemy data type translations when the columns in copy tasks are not specified, but it doesn't necessarily cover all cases. So the proposed actions here:
While reviewing this, it might be good to look into #36, particularly if point 2 is done (table creation can be part of loading).
Windows makes environment variables upper case, but SAYN expects the same case as the names internally. Example execution on Github Actions: https://github.com/173TECH/sayn/pull/71/checks?check_run_id=1308232133
Current SQL/AutoSQL/Copy tasks are directed towards the default DB. Add the option to direct those towards other DBs defined in the project. Suggested approach:
db
entrydb
entry in destination
db
entry in destination
The dependency graph is reversed at the moment when we produce it with sayn dag-image
making parents look like children.
The current config for pydantic allows for extra fields to be present when validating, but this can be confusing so we should remove that. To do that, the extra
value should be forbid
in all our validators (more info: https://pydantic-docs.helpmanual.io/usage/model_config/)
For incremental tasks, we currently abort if the DDL column order does not match the column order of the table. Replace that with dynamic column ordering.
add an example of chaining tasks in sayn documentation
e.g. sayn run -t task_1 -t task_2
We need to add support for append only mode with the copy task. This will require a sayn load time field to be added to the loads.
Database
Currently, database errors raise exceptions and so the message is just the error produced by the driver. We can improve this in known cases like autosql and copy tasks or with methods of the Database
class so that we output a more informative error. Some examples:
load_data
, "Too many variables" we could suggest to use max_batch_rows in the credentialsPreset
Currently, when you run a task with a preset that doesn't exist it throws this sort of traceback:
Result.Err (get_tasks_dict::task_parsing_error): {'errors': {'dim_customers': Result.Err (get_task_dict::missing_preset): {'group': 'core', 'task': 'dim_customers', 'preset': 'modelling'}
Would be nice to just display "missing_preset in task: dim_customers" instead
We need to revamp the DDL so it is more efficient in handling how different databases implement DDLs (e.g. SQLite cannot amend the primary key with an ALTER statement).
Simplify what is exposed to the user in the database API. Keep only the following methods:
load_data
crashes when records are not homogeneous. We can simply reshape the batch before issuing the insert statement to fix it here:
sayn/sayn/database/__init__.py
Line 298 in d1cf36b
Integrate the option to initialise different projects:
sayn init
), this should include only the minimum required to get a SAYN project going from scratch.The DB refactoring changes introduced the following cosmetic issues on the CLI:
Would be great to have the possibility to work with DB2 databases. It should be fairly easy to implement.
Run a set of YAML validation on the credentials based on the type. For example:
There is currently an issue on Windows which prevents SAYN to run. Please see attachments.
We should rethink the flow of the first time user experience for people that get started with SAYN:
When load_data is called with no ddl information, table structure is determined by the first row. If this row contains None values the type can't be determined and table creation will fail.
If a model is a view in BigQuery and is changed to table materialisation, the current process will fail as it does not issue a DROP VIEW. To do:
We need to make the following error messages human friendly.
Result.Err (parsing::read_file_error): {'filename': PosixPath('project.yaml')}
Result.Err (task_type::invalid_task_type_error): {'type': 'nodummy'}
Currently sayn compile
produces compiled sql files for sql
and autosql
task types. For autosql
tasks the output is only the SELECT
statement provided by the user. Copy tasks don't produce any output anymore.
When using sayn compile -d
the full SQL to be ran should be produced in the compile folder as it's now in 0.3.0.
Database.load_data
needs some improvements:
load_data_stream
COPY
statement we have for postgresqlCurrently if running / excluding multiple tasks, you have to chain the flags such as:
sayn run -t t1 -t t1
sayn run -x t1 -x t2
We want to alter this behaviour so all tasks can be added after one flag only:
sayn run -t t1 t2
sayn run -x t1 t2
macros
folder in sql
folder should automatically be imported and made available in the Jinja environment.The current error message if a parent is not present in any dag is reversed. It seems to be structured as:
"Some parents are missing from dag
In task {parent_task_name}: {task_name}"
It should be the opposite:
"Some parents are missing from dag
In task {task_name}: {parent_task_name}"
Add a spinner to the new UI in order to reduce the verbosity of usual SAYN runs:
with self.step('Execute SQL'):
return self.failed()
With this code, the step is tracked as successful. We need a better way to fail a step when using the context.
This only happens when you specify the test in the config of a model that is materialised as a view in the project.yaml
Database: Snowflake
SAYN version: 0.6.5
Line causing the issue:
{{ config(columns=[{'name': 'payer_id', 'tests': ['unique', 'not_null']}]) }}
Currently, the BigQuery part of the documentation does not have an example of implementation which makes it difficult to use. Also, the credentials_path is mislabeled as required when it's not if you have gcloud installed.
There is an issue when re-running SAYN run, please see detail steps below:
sayn init
-> works finesayn run
(1st time) -> works finesayn run
(2nd time) -> errorsPlease see the detail of the error below:
✖ Result.Err (database_error::sql_execution_error): {'exception': OperationalError('error in view f_rankings: no such table: main.f_fighter_results'), 'db': 'warehouse', 'script': 'ALTER TABLE sayn_tmp_dim_tournaments RENAME TO dim_tournaments;\n'}
✖ Result.Err (database_error::sql_execution_error): {'exception': OperationalError('error in view f_rankings: no such table: main.f_fighter_results'), 'db': 'warehouse', 'script': 'ALTER TABLE sayn_tmp_dim_tournaments RENAME TO dim_tournaments;\n'}
To avoid errors when multiple instances run at the same time, we should include a pid with something like this: https://pypi.org/project/simple-pid/
What:
Copy tasks with columns in the ddl
attribute fail if the column types are not specified. The table creation statement are empty (i.e. they do not define the column names / types). See below create_table output:
CREATE OR REPLACE TABLE test_staging.sayn_tmp_xxx
;
Other Info:
Task definition:
type: copy
source:
db: db1
schema: s1
table: t1
destination:
db: db2
tmp_schema: ts1
schema: s2
table: t2
ddl:
columns:
- c1
- c2
The name "dag" to separate tasks is bit misleading as it somehow implies SAYN creates multiple DAGs whilst only one is created currently. We will change to calling our concept of "dag" to "group" from now on. The following changes need to be implemented:
sayn run -t dag:x
to sayn run -t group:x
.Currently, the load_data method on the database API only inserts. It would be nice to allow the possibility to automatically create the table if it does not exists, append otherwise. A bit similar to the if_exists
param from pandas.to_sql.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html
In addition, we should add the ability to accept dataframes, list of lists and list of dicts.
Review the overall errors from Python tasks. We need to be able to differentiate easily errors which are caused to SAYN setup (e.g. missing module file, wrong name of module file) or Python code error. List of things to fix:
Test
, test.Test
, test.test.Test
).init.py
error✖ Result.Err (task_result::missing_result_error): {'result': None}
When a table in BigQuery has partitioning or clustering set, a create or replace table that changes that definition will fail. Need to update the logic to perform a DROP TABLE and then a CREATE TABLE instead.
We currently have the sayn init unit test commented, we need to get this re implemented.
Implement the data testing feature using a sayn test
command with:
As part of this, we will replace the DDL settings with a columns parameters. Generic tests will be defined in this definition. We will introduce a parameter for table properties (e.g. distribution key, etc.) currently set in the DDL and post-hook (will replace the indexes DDL on PostgreSQL for ex).
As it currently stands, if a task fails skips or fails, its children will be skipped. We should add the ability to control this on the task definitions in the DAG.
I'm running a copy task with a table users
that contains a metadata
column with type json
.
tasks:
copy_users:
type: copy
source:
db: input_db
table: users
destination:
db: output_db
table: users
And when running the pipeline I get the following error:
INFO|Run Steps: Prepare Load, Load Data, Move Table
INFO|[1/3] Executing Prepare Load
INFO|[1/3] Prepare Load (97.8ms)
INFO|[2/3] Executing Load Data
ERROR|Failed (2s) invalid input syntax for type json
DETAIL: Token "'" is invalid.
CONTEXT: JSON data, line 1: {'...
COPY sayn_tmp_users, line 10, column metadata: "{'title': 'Engineer'}"
I've successfully copied other tables that don't have json
columns.
Both input and output databases are PostgreSQL, runnning in Python 3.9.13, SAYN version 0.6.5
.
Many thanks in advance!
Add further unit testing to test core tasks:
When a parent fails and the child is skipped, we currently display the step of when the parent failed it seems, see below error message:
:warning: [7/7] f_rankings (0ms) On step 5/6 Move: Skipping due to parent errors (0ms)
We need to remove the step part which is not relevant as we are skipping the task.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.