Giter VIP home page Giter VIP logo

pluto's People

Contributors

bellatoris avatar bgchun avatar bokyoungbin avatar differentsc avatar junggil avatar kyungissac avatar pigbug419 avatar sanha avatar seojangho avatar taegeonum avatar taehunkim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pluto's Issues

WordCounter on Spark Streaming

We already investigate the limitations of microstream in current stream systems such as Storm and REEF.

We found that we also have to perform a similar investigation to Apache Spark Streaming.

We will implement WordGenerator and WordAggregator, and measure the limitation of processed queries in the same time.

Explore existing streaming API

Before we design our user API, we need to fully understand API of state-of-the-art stream processing systems.

To begin with, we decided to explore the user API of Apache Flink and Spark Streaming. This exploration should begin after we get sufficient background on stream processing frameworks.

API: Support KAFKA for data source

Messaging systems (like Kafka, Kineses, MQTT, ...) are widely used as a data source in stream processing. We are planning to Kafka first and make others available in the future.

Should be addressed with #11

Task: Implement basic stream operators

Mist should support various streaming operations, such as filter, flatMap, map, reduce.
In MistTask, the operations are represented as operators.
For the stepping stone, we need to implement basic stream operators: MapOperator, FilterOperator, and so on.

Add checkstyle to mist

For automatic check of code readability and formatting, we need to add checkstyle to mist.

Implement internal query interface

We need to implement an internal operation representation inside the MIST query. Specifically, we need to define a procedure of how a single data would be processed inside MIST.

We should have a clear definition of ...

  • How stream input data is represented inside the query? (#15)
  • How input data would be transformed? (#16)
  • How state would be changed by input data? Which window should be used for managing the state? (#17)

Each would be addressed on separate issues.

Implement simple word generator-aggregator model on REEF

As the first step of MIST project, we need to implement a simple word generator-aggregator stream processing application on REEF. This issue is important for two reasons below.

  • We can measure the number of streaming queries which can be processed in the environment without ZooKeeper.
  • We can practice development on REEF environment (especially Tang and Wake) and get used to some useful tools for building MIST like RemoteManager, NetworkService, etc. We can also think about how we can provide simple API to hide some complex things of REEF.

Implement a basic physical planner which supports a DAG operation

Physical planner converts the logical plan int a physical plan according to the 1) structure of the logical plan (DAG), 2) element sharing and 3) parallelism.

First, we consider 1).
We need to implement a basic physical planner which just converts the logical plan into a physical plan.
The planner should decide which elements are executed in synchronous or asynchronous stages.
If the elements are performed linearly, the planner just allocates the elements into synchronous stages ( in the same thread).
If there are branches in the logical plan, the planner allocates asynchronous stages to the downstream elements ( in the different threads).

Simple word counter with one task

As the first step of the ISSUE #1 , I will implement a simple word counter with one task.

The task performs both word generating and aggregating.

I'm going to implement it, starting with HelloREEF example REEF application.

Leverage disk for storing query states

To process multiple queries in one machine, we need to store the states of those queries. However, memory space can be not enough for storing all the states. This case can happen in some queries like online ML (need to store many parameters).

The basic approach for this is to store its state in disk when the query is inactive, and reload it to memory and process when the data comes in. However, reading data from the disk should be slow so processing time can increase. Because of that, we need to address those things below.

  • Which data should stay on memory? Which criteria (query SLA, data incoming frequency, ...) should we use?
  • Can we apply batched processing for some cases? i.e. Stack multiple data and update the state at one go when the state is available on memory. It can reduce the number of disk read/write, but it will delay the state update.
  • Can we predict the data arrival time and apply pre-fetching for some queries?

Task: Profiling physical operators

We need to measure physical operator's

  1. resource usage
  2. or the rate of input/output flow
    to figure out the behavior of operators and user queries.
    Maybe this can be used for 1) load balancing and 2) auto parallelism.

Implement on-demand query state loading

As a first step of Issue #7, we will implement the on-demand query state loading feature on MIST. In this feature

  • Query state will be stored in local disk
  • When data arrives, the state will be loaded into memory and the updated state will be saved back to disk.

We'll meausre the performance of this approach and find some ways to improve.

Review necessary papers

Because we need to get some background on stream processing, we need to review some important papers on stream processing. The important papers I think are listed below.

Taehun and I will review those papers in a short time and have a short discussion on it. This issue will remain unclosed in case we need more papers to read.

Please comment to this issue if you come up with more papers worth reading.

Implement internal query state transformation interface

After data is transformed, internal query state should be updated by the newly input data. For that we should provide a interface for

(1) defining internal state
(2) a updating function with old state and input data
(3) defining window information (window size & slide interval)

API: Implement generic interface for outputting result

For outputting the result, we need to a generic interface for outputting the result.

  • How do we represent the result?
  • Where do we output the result?

By separating those two things, we can have a more flexibility on defining output method.

Query-evaluator scheduling

In addition to scheduling within an Evaluator, we need to assign each streaming query to appropriate evaluator.

This scheduling can be tricky, so we can start from the simple approach (like random assignment, round-robin) and measure the performance. After that, we can improve the performance by leveraging some performance metric (CPU, memory, ...) for query assignment.

Define input data stream inside the internal query

We need to define a way for the input data stream to be represented inside the internal query interface.

I think we can use a simple tuple interface with each field's name defined for this. We do not have to define a specific key here.

Implement low-level API

We need to have a low-level representation of MIST query. This should be done before issue #5.

I don't think we need to have a data-flow representation as Storm does, because we assumed that each query is processed in one machine. So, I think we can focus on the issues below when designing internal APIs.

  1. How can we fetch the data?
  2. How can we transform the query state for the given data?
  3. How can we output the result?

Task: Implement a simple round-robin allocator

MistTask allocates OperatorChains to executors.
Then, the executors run and activate the dedicate OperatorChains when their inputs are sent.

For the first step of the above logic, we will simply implement a simple round-robin allocator
which dedicates OperatorChains in round-robin way.

API: Support user-defined functions

To support various operations which cannot be expressed via basic operators, we need to provide an interface for defining UDF in MIST API

SSM: Read/Write states

Make getState() and setState() to allow the query element to read/write from the SSM db.

Support multiple queries in one Evaluator

To increase the number of queries which can be processed in one machine, we need to unify the runtime environment for many queries to reduce the overhead from maintaining large number of queries. Currently, REEF Evaluator is a separate process for one container from RM, so we need to process multiple queries in a single Evaluator.

We need to address those particular issues below.

  • Put multiple queries in one Task via user API (issue #5)
  • Separate each query processing in one REEF evaluator and schedule them within the evaluator
  • Leverage disk space to store large number of query states

Implement input data transformation interface

Internal query should have a concrete information of how data should be transformed. We will process this issue by providing interface for (1) setting the input data from issue #11 (2) registering arbitrary transforming functions to the query (3) chaining those functions in order. In one query, multiple chains could be used in case of join operation.

For (2), we could leverage for lambda function to make an interface simple.

Shorten the gap of generating words

The basic experiments shows that the most important bottleneck of microstreams in current REEF system is a memory limitation.

But, CPU also can be a bottleneck if queries requires more computations.

In this experiment, we examine it by shortening the gap of generating words from 1 second to lower.

SSM: Putting queries into the SSM

Currently, we have only thought about putting query elements' states into the SSM,
we should consider putting the Query itself into the SSM. (for query statistics etc.)

API: Make basic MISTStream interface

We need to make basic MISTStream interface to represent created-by-user data stream. Basic MISTStream interfaces are

  • SourceStream
  • OperatorStream

For those streams, MIST system need know about the type and other necessary informations, so MISTStream and its derivative classes should have methods for those information.

SSM: Memory caching

The SSM should control which states go into the disk, and which remain in the memory.
We need to make decisions on what kind of caching strategy we will use

Design user API

We need a good user API for the system. Particularly, we need to hide the internal implementation of MIST including REEF components. I think we should also hide the data flow model below streaming queries and provide data-centric interface for users.

To be more specific, we need to provide those three features below via our MIST API.

  • Data fetching from diverse sources (e.g. HDFS, KAFKA, ...)
  • Data transformation within queries. It should be explicit and general enough to support various operations.
  • Convenient output of query results

Issue #3 should be addressed before this issue.

Task: Convert Logical plan to physical plan

A logical plan is represented by a data flow, which is generated by user-level API.
A physical plan is represented by actual mist operators, in which the number of parallelism is set.

We should convert a logical plan to a physical plan in mist task.

Task: Implement mist executor

First step, we need to implement simple mist executor in which the scheduling policy is FIFO.
One executor consists of a queue, a thread and a scheduler. The thread fetches tasks from the queue and processes the tasks.

We will make scheduling policy pluggable, in order to change the policy easily.

Designing an optimal scheduling policy is future work.

API: Support NCS for data source

Our API need to support network input connection for fetching data. For network implementation, we can leverage NCS implemetation from Wake.

Should be addressed with issue #11

Task: Support auto-parallelism in operators

We need to think about how to parallelize operators. There are two ways:

  1. Statically decides the number of parallelism when generating physical plans with cost-base optimization(?)
  2. Dynamically changes the number of parallelism by monitoring the operator stages and detecting bottlenecks.
    This topic needs a discussion.

First, we can explicitly configure the operator parallelism before doing auto-parallelism.

API: Make NCS source configuration builder for providing stream source configuration

To provide source configuration, we need a source configuration builder for users to provide configurations for stream sources.

It will store its configuration in form of a key-value map, and each different type of source will have different setter methods according to its type.

This is not related to Tang Configuration and ConfigurationBuilder.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.