uralian / ignition Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 4.0 731 KB

Creating reusable workflows for Apache Spark

License: Apache License 2.0

Scala 100.00%

ignition's People

Contributors

Stargazers

Watchers

Forkers

ww102111 wangwangbupt cisdielectric logicalguess

ignition's Issues

Implement frame JDBC Input step

Need to implement a step for importing data from a jdbc-compliant database.

Implement step and flow listeners

Implement the following listeners:

Step listener to be notified on step computation
Flow listener to be notified when flow starts/stops
Stream data listener to be notified on each batch

Implement frame JDBC Output step

Need to implement a step for writing data into a JDBC-compliant database.

Add universal package and assembly options to the build script

Need to provide both options for further deployment:

Building fat jar with all dependencies included
Building distribution package with all dependencies in lib folder

Implement stream UpdateState function as a Merger construct

The UpdateState function can be exposed as a Merger step for 2 inputs:

the first argument is a DataFrame wrapper around Seq[Row] - the input data
the second argument is the optional state as a DataFrame (0 rows corresponds to None)
the output is the result state as a DataFrame (0 rows corresponds to None)

Rewrite stream test harness using Spark listeners instead of timeouts

Rework artifact into multiple artifacts + all

Currently there's only one artifact, ignition.jar
To make it flexible, need to refactor that into multiple jar files:

ignition-core
ignition-db
ignition-dsa
etc.

Plus one fat ignition-all.jar

Implement step cache, to avoid recomputing outputs

Currently, each output value is recomputed every time the output is accessed. Need to implement the internal step cache to avoid that, and reset operation to reset the value and force the recomputation.

Fix Date functions in stream.Filter

Date functions are not working because of the serialization to string.

Create integration tests

Need to create IT configuration and move the appropriate unit tests there or create new ones:

RestClient
Cassandra
Mongo
Kafka

Fix stream.Filter and stream.FilterSpec after rework

Combine CsvFileInput and TextFileInput

Combine the two steps into one and extend its functionality to provide the following:

Row separation strategy
- newline - use textFile
- regex - use Source.fromFile, then split
- none - use Source.fromFile
Column separation strategy
- regex - use split on each row
- fixed - use take(), etc.
- none - whole row
Column names/types
- none - use COL0, COL1, etc. with String type
- schema - validate and apply conversion

Implement Cache step

Implement a step to provide caching of intermediate results

Upgrade dslink-scala-spark library to 0.2.0

Change spark library scope to "provided"

Currently spark libraries and their dependencies are added to the distribution; change their scope to provided, but also allow them to be added at runtime when running the examples

Fix Cassandra unit tests

Implement stream DSA Output step

Fix stream steps, commented out after the Step hierarchy refactoring

how to use

how to use directionly

Implement RangeInput step

Implement a step for generating a set of numeric values in a given range

Implement frame DSA Input step

Implement Repartition step

Implement a step which would allow increasing or decreasing the number of data partitions.

AddFields step is missing in FrameStepFactory

Need to add AddFields step to the mix

It is not possible to run data flows and stream flows from the same VM

frame.Main and stream.Main initialize their own contexts, need to refactor that to use a single SC

Fix IN operator for stream.Filter

Operator IN does not work for stream filter, because it cannot parse it from the String. IN needs to be added as a UDF function.

Formula schema calculation needs to be fixed

Currently, Formula will fail if the input step does not have any rows, because of the way the schema gets computed. This needs to be fixed.

Make XML and JSON tag names consistent

There's inconsistency in naming various elements of step representations: "group-by" vs "groupBy", "columns" vs "fields" etc. Need to make it consistent throughout the app. Also, think of making the tags shorter (like "csv-input" vs "csv-file-input", "debug" vs "debug-output" etc.)