Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Apache Calcite library is a basic library used in SQL analysis and parsing in almost all SQL-based big data project. Explore the project from usage perspective in ETL-framework.
Currently, ETL Job Launcher is tightly coupled to validate schema of transformed data & load data will using be validation step results. This should be more generic so that it would use boolean from the job configs to validate transformed data or not.
Currently, multiple feeds in extraction are supported but passing through the transformation stage & finally loading multiple feeds are not supported. Required support for multiple feeds. Also, further strategy required for support of atomicity in-case of multiple feeds.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe. https://absaoss.github.io/spline/
The spline is a Data Lineage tracking and visualization tool for Apache Spark ™. It would be good to analyze metrics with the output of Spline.
Describe the solution you'd like
Integration of the Spline with DaFlow job flow.
Is your feature request related to a problem? Please describe. Apache Iceberg is a new table format for large, slow-moving tabular data. From the load perspective of the ETL framework support is required for the Iceberg.
Describe the solution you'd like
Exploration and implementation of code are� required for supporting a new format in the framework.
Is your feature request related to a problem? Please describe.
Currently, DaFlow jobs are only XML based. It should accept job definitions from a different format. YAML is one of the popular formats.
Describe the solution you'd like
Build parser classes for parsing DaFlow job definition from the YAML file.
Is your feature request related to a problem? Please describe.
Currently, elt-framework is based on SBT build tool. However, for managing a multi-module project, Maven build tool is easier and extensible. Moving build tool from sbt to maven is much needed for refactoring of the code too.
ETL Framework currently supports the basic transformation functions like filter, explode, select. Joining of the two feeds is one of the most common and basic function ETL operations.
Is your feature request related to a problem? Please describe.
DaFlow is a complex project with several modules based on several technologies and It is necessary to have an easy good simple usage showcase.
Describe the solution you'd like
Docker container-based demo could be easily achievable to showcase DaFlow usage.
Functionality / Module required for validation and transformation of a feed schema. Maintaining the versions of a schema is one of the basic requirement. Also, the schema should be easily accessible from various endpoints based on different methods.
Schema registry framework in future can be extensible for storing different vendors data-types mapping.