vedharaju / hurricane Goto Github PK
View Code? Open in Web Editor NEWHurricane performs chained Map-Reduce jobs in small batches at a high frequency (about once per second).
Hurricane performs chained Map-Reduce jobs in small batches at a high frequency (about once per second).
Tasks:
Add "index" field to segment data type, representing the index of the segment within the RDD
Currently, the "go run" command incurs significant overhead every time each UDF is invoked (about 1-2 seconds). We need to compile the UDFs into binaries before worker nodes are started. The existing demo workflow files should be updated to point to the compiled binaries rather than the source files.
@vedharaju
@hogbait
Rdds are partitioned by a vector index such as (1,3). This means partition by the second and fourth fields of the tuple. (fields are 0-indexed), additionally the number of partition buckets and the number of segments must be specified
Rdds need to specify
There are a few rules:
When the reduce flag is not set, the scheduler is free to arbitrarily shuffle input partitions among the tasks/segments for the next job.
Figure out (and implement) the best way to get test data in and out of the system. This will require writing custom UDFs for reading/recording test data.
Source UDFs: The input data generally comes from a single job defined by a single UDF command. This job will have multiple tasks/segments. The UDF command can take a command-line argument which is an integer index of the task. This index can be used to specify what kind of data to generate (in real life, it would correspond to the ID of the kafka broker to fetch data from). The UDF will also receive command-line arguments corresponding to the start time and duration of the batch. Multiple calls to the UDF with the same start time and duration should return exactly the same data (eg, using a deterministic pseudo-random number generator).
Sink UDFs: Probably just mock UDFs that will discard the data that they receive. However, they could be used to write data to an output file for debugging.
Describe some example workflows that will be used to test our system. The most important example is a moving sum such as: "count the number of actions per user in the past 30 seconds". This requires a map, reduce, and a windowed sum (dependency on previous RDD).
An workflow syntax file, as well as code for the UDF files should be created and executed on the system.
Augment the "WorkflowEdge" model with a "delay" parameter indicating the number of batches to delay the output (delay=0 refers to the current batch). Also add this to the workflow syntax.
A separate pull request describes some more syntax additions for partitioning, so figure out a way to easily tag more data to Protojobs and edges
Caching layer for performance.
Possibly use map for each structure with ID --> object.
@hogbait
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.