Giter VIP home page Giter VIP logo

hurricane's People

Contributors

babymastodon avatar vedharaju avatar vikasvelagapudi avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

jtwarren

hurricane's Issues

Figure out how to precompile the UDFs

Currently, the "go run" command incurs significant overhead every time each UDF is invoked (about 1-2 seconds). We need to compile the UDFs into binaries before worker nodes are started. The existing demo workflow files should be updated to point to the compiled binaries rather than the source files.

Express RDD partitions in data models and workflow syntax

Rdds are partitioned by a vector index such as (1,3). This means partition by the second and fourth fields of the tuple. (fields are 0-indexed), additionally the number of partition buckets and the number of segments must be specified

Rdds need to specify

  1. partition vector index
  2. number of segments (eg. tasks)
  3. number of partition buckets
  4. a flag indicating whether a task must receive all of the tuples in a single partition per input (eg, whether it a reduce task)

There are a few rules:

  1. the number of segments may not exceed the number of partitions for an input with the reduce flag set
  2. all inputs with the reduce flag set must have the same number of partitions

When the reduce flag is not set, the scheduler is free to arbitrarily shuffle input partitions among the tasks/segments for the next job.

Mock input/output brokers

Figure out (and implement) the best way to get test data in and out of the system. This will require writing custom UDFs for reading/recording test data.

Source UDFs: The input data generally comes from a single job defined by a single UDF command. This job will have multiple tasks/segments. The UDF command can take a command-line argument which is an integer index of the task. This index can be used to specify what kind of data to generate (in real life, it would correspond to the ID of the kafka broker to fetch data from). The UDF will also receive command-line arguments corresponding to the start time and duration of the batch. Multiple calls to the UDF with the same start time and duration should return exactly the same data (eg, using a deterministic pseudo-random number generator).

Sink UDFs: Probably just mock UDFs that will discard the data that they receive. However, they could be used to write data to an output file for debugging.

Figure out demo workflows

Describe some example workflows that will be used to test our system. The most important example is a moving sum such as: "count the number of actions per user in the past 30 seconds". This requires a map, reduce, and a windowed sum (dependency on previous RDD).

An workflow syntax file, as well as code for the UDF files should be created and executed on the system.

add to syntax: RDDs can depend on previous batches

Augment the "WorkflowEdge" model with a "delay" parameter indicating the number of batches to delay the output (delay=0 refers to the current batch). Also add this to the workflow syntax.

A separate pull request describes some more syntax additions for partitioning, so figure out a way to easily tag more data to Protojobs and edges

@vedharaju @vikasvelagapudi

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.