vedharaju / hurricane Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 1.41 MB

Hurricane performs chained Map-Reduce jobs in small batches at a high frequency (about once per second).

Shell 0.61% Go 99.32% Python 0.07%

hurricane's People

Contributors

Stargazers

Watchers

Forkers

jtwarren

hurricane's Issues

Add indexes to foreign key fields

Workflow that depends on old batches

Do main master thread that launches and tracks batches

Tasks:

Add "index" field to segment data type, representing the index of the segment within the RDD

Fault tolerance

Figure out how to precompile the UDFs

Currently, the "go run" command incurs significant overhead every time each UDF is invoked (about 1-2 seconds). We need to compile the UDFs into binaries before worker nodes are started. The existing demo workflow files should be updated to point to the compiled binaries rather than the source files.

Eviction of old batch data

Do worker data models for storing segments

@vikasvelagapudi

Do UDFs: loading and executing on workers

@vedharaju
@hogbait

Express RDD partitions in data models and workflow syntax

Rdds are partitioned by a vector index such as (1,3). This means partition by the second and fourth fields of the tuple. (fields are 0-indexed), additionally the number of partition buckets and the number of segments must be specified

Rdds need to specify

partition vector index
number of segments (eg. tasks)
number of partition buckets
a flag indicating whether a task must receive all of the tuples in a single partition per input (eg, whether it a reduce task)

There are a few rules:

the number of segments may not exceed the number of partitions for an input with the reduce flag set
all inputs with the reduce flag set must have the same number of partitions

When the reduce flag is not set, the scheduler is free to arbitrarily shuffle input partitions among the tasks/segments for the next job.

Mock input/output brokers

Figure out (and implement) the best way to get test data in and out of the system. This will require writing custom UDFs for reading/recording test data.

Source UDFs: The input data generally comes from a single job defined by a single UDF command. This job will have multiple tasks/segments. The UDF command can take a command-line argument which is an integer index of the task. This index can be used to specify what kind of data to generate (in real life, it would correspond to the ID of the kafka broker to fetch data from). The UDF will also receive command-line arguments corresponding to the start time and duration of the batch. Multiple calls to the UDF with the same start time and duration should return exactly the same data (eg, using a deterministic pseudo-random number generator).

Sink UDFs: Probably just mock UDFs that will discard the data that they receive. However, they could be used to write data to an output file for debugging.

Figure out demo workflows

Describe some example workflows that will be used to test our system. The most important example is a moving sum such as: "count the number of actions per user in the past 30 seconds". This requires a map, reduce, and a windowed sum (dependency on previous RDD).

An workflow syntax file, as well as code for the UDF files should be created and executed on the system.

add to syntax: RDDs can depend on previous batches

Augment the "WorkflowEdge" model with a "delay" parameter indicating the number of batches to delay the output (delay=0 refers to the current batch). Also add this to the workflow syntax.

A separate pull request describes some more syntax additions for partitioning, so figure out a way to easily tag more data to Protojobs and edges

@vedharaju @vikasvelagapudi