sparsecode / daflow Goto Github PK

Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.

License: Other

Scala 90.18% Shell 7.06% Dockerfile 2.76%

etl apache-spark scala parquet json hive avro csv join-data transformation-rules

daflow's People

Contributors

Stargazers

Watchers

Forkers

thejanagala majiajue profbiyi ramuchava geekay2015 vishurkamble thhuestc szhorizon bpprak locnguyenhuu lavecoral wuchunfu nsinghdeveloper

daflow's Issues

Explore Apache calcite for sql parsing use-case in transformation rules.

Apache Calcite library is a basic library used in SQL analysis and parsing in almost all SQL-based big data project. Explore the project from usage perspective in ETL-framework.

Refactored etl launch job executor to make more generic & robust

Currently, ETL Job Launcher is tightly coupled to validate schema of transformed data & load data will using be validation step results. This should be more generic so that it would use boolean from the job configs to validate transformed data or not.

Update README.md of the project.

Update the README.md files of the various modules of the project and detailed README.md for building, deploying and running the project.

Configure etl feed metrics stats publishing

Currently etl_framework support two frameworks for publishing stats for the ETL feed.

Entries in hive table for each feed run.
Publishing stats on Prometheus.

Right now, it is tightly coupled and code needs a separation from the feed code so that based on job_static_param stats will be published.

DaFlow Metrics

Explore OpenCensus(https://opencensus.io/) api's and leverage them to publish metrics to DataDog, Prometheus and several other frameworks.

Add support for multiple feeds in ETL job.

Currently, multiple feeds in extraction are supported but passing through the transformation stage & finally loading multiple feeds are not supported. Required support for multiple feeds. Also, further strategy required for support of atomicity in-case of multiple feeds.

Support GraphQL in Schema Registry along with grpc and thrift.

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Spline Support

Is your feature request related to a problem? Please describe.
https://absaoss.github.io/spline/
The spline is a Data Lineage tracking and visualization tool for Apache Spark ™. It would be good to analyze metrics with the output of Spline.

Describe the solution you'd like
Integration of the Spline with DaFlow job flow.

Explore and support Apache Iceberg table format

Is your feature request related to a problem? Please describe.
Apache Iceberg is a new table format for large, slow-moving tabular data. From the load perspective of the ETL framework support is required for the Iceberg.

Describe the solution you'd like
Exploration and implementation of code are� required for supporting a new format in the framework.

Support Yaml based DaFlow Job Configurations.

Is your feature request related to a problem? Please describe.
Currently, DaFlow jobs are only XML based. It should accept job definitions from a different format. YAML is one of the popular formats.

Describe the solution you'd like
Build parser classes for parsing DaFlow job definition from the YAML file.

Move project build from SBT to Maven

Is your feature request related to a problem? Please describe.
Currently, elt-framework is based on SBT build tool. However, for managing a multi-module project, Maven build tool is easier and extensible. Moving build tool from sbt to maven is much needed for refactoring of the code too.