snuspl / pluto Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 3.0 2.92 MB

MIST: High-performance IoT Stream Processing

License: Apache License 2.0

Java 99.79% Shell 0.21%

data-stream internet-of-things stream stream-processing

pluto's People

Contributors

Stargazers

Watchers

Forkers

taegeonum kyungissac differentsc

pluto's Issues

Task: Implement improved allocation policy

After implementing the #42, we need to develop an improved allocation policy, to reduce data processing latency.

This need a discussion.

WordCounter on Spark Streaming

We already investigate the limitations of microstream in current stream systems such as Storm and REEF.

We found that we also have to perform a similar investigation to Apache Spark Streaming.

We will implement WordGenerator and WordAggregator, and measure the limitation of processed queries in the same time.

Explore existing streaming API

Before we design our user API, we need to fully understand API of state-of-the-art stream processing systems.

To begin with, we decided to explore the user API of Apache Flink and Spark Streaming. This exploration should begin after we get sufficient background on stream processing frameworks.

Add maven configuration and gitignore file

As a start-up of the project, we should add maven configuration and gitignore file for the project.

Driver: Implement basic driver for MIST

We need a basic REEF driver for MIST to . Right now, MIST driver will just interconnect between MIST Client and MIST Task.

API: Support KAFKA for data source

Messaging systems (like Kafka, Kineses, MQTT, ...) are widely used as a data source in stream processing. We are planning to Kafka first and make others available in the future.

Should be addressed with #11

API: Make MISTExecutionEnvironment for submitting queries

MISTExecutionEnvironment is responsible for controlling user queries. We first need to implement user query submission part, including avro serialization & query verification.

API: Implement generic interface for fetching data

Our API should be able to fetch data from multiple issues (Network, HDFS, KAFKA, ...) but we do need to implement a generic interface for configuring those data sources.

Task: Implement basic stream operators

Mist should support various streaming operations, such as filter, flatMap, map, reduce.
In MistTask, the operations are represented as operators.
For the stepping stone, we need to implement basic stream operators: MapOperator, FilterOperator, and so on.

Add checkstyle to mist

For automatic check of code readability and formatting, we need to add checkstyle to mist.

Implement internal query interface

We need to implement an internal operation representation inside the MIST query. Specifically, we need to define a procedure of how a single data would be processed inside MIST.

We should have a clear definition of ...

How stream input data is represented inside the query? (#15)
How input data would be transformed? (#16)
How state would be changed by input data? Which window should be used for managing the state? (#17)

Each would be addressed on separate issues.

Implement simple word generator-aggregator model on REEF

As the first step of MIST project, we need to implement a simple word generator-aggregator stream processing application on REEF. This issue is important for two reasons below.

We can measure the number of streaming queries which can be processed in the environment without ZooKeeper.
We can practice development on REEF environment (especially Tang and Wake) and get used to some useful tools for building MIST like RemoteManager, NetworkService, etc. We can also think about how we can provide simple API to hide some complex things of REEF.

Implement a basic physical planner which supports a DAG operation

Physical planner converts the logical plan int a physical plan according to the 1) structure of the logical plan (DAG), 2) element sharing and 3) parallelism.

First, we consider 1).
We need to implement a basic physical planner which just converts the logical plan into a physical plan.
The planner should decide which elements are executed in synchronous or asynchronous stages.
If the elements are performed linearly, the planner just allocates the elements into synchronous stages ( in the same thread).
If there are branches in the logical plan, the planner allocates asynchronous stages to the downstream elements ( in the different threads).

Simple word counter with one task

As the first step of the ISSUE #1 , I will implement a simple word counter with one task.

The task performs both word generating and aggregating.

I'm going to implement it, starting with HelloREEF example REEF application.

Leverage disk for storing query states

To process multiple queries in one machine, we need to store the states of those queries. However, memory space can be not enough for storing all the states. This case can happen in some queries like online ML (need to store many parameters).

The basic approach for this is to store its state in disk when the query is inactive, and reload it to memory and process when the data comes in. However, reading data from the disk should be slow so processing time can increase. Because of that, we need to address those things below.

Which data should stay on memory? Which criteria (query SLA, data incoming frequency, ...) should we use?
Can we apply batched processing for some cases? i.e. Stack multiple data and update the state at one go when the state is available on memory. It can reduce the number of disk read/write, but it will delay the state update.
Can we predict the data arrival time and apply pre-fetching for some queries?

Task: Profiling physical operators

We need to measure physical operator's

resource usage
or the rate of input/output flow
to figure out the behavior of operators and user queries.
Maybe this can be used for 1) load balancing and 2) auto parallelism.

Implement on-demand query state loading

As a first step of Issue #7, we will implement the on-demand query state loading feature on MIST. In this feature

Query state will be stored in local disk
When data arrives, the state will be loaded into memory and the updated state will be saved back to disk.

We'll meausre the performance of this approach and find some ways to improve.

API: Implement avro-serialized output mechanism

Some Java objects could be too complicated to be represented via simple string, so we should provide more elaborate way for serializing it. We can use Apache Avro for this.

Separate WordGenerator app and WordAggregator app

Currently WordCounter app contains WordGeneratorTask and WordAggregatorTask.

I'll separate the two task to two different application to make them run in different machines.

Review necessary papers

Because we need to get some background on stream processing, we need to review some important papers on stream processing. The important papers I think are listed below.

Spark Streaming - http://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf
TimeStream - http://research.microsoft.com:8082/pubs/191070/timestream_eurosys13.pdf

Taehun and I will review those papers in a short time and have a short discussion on it. This issue will remain unclosed in case we need more papers to read.

Please comment to this issue if you come up with more papers worth reading.

Implement internal query state transformation interface

After data is transformed, internal query state should be updated by the newly input data. For that we should provide a interface for

(1) defining internal state
(2) a updating function with old state and input data
(3) defining window information (window size & slide interval)

API: Implement generic interface for outputting result

For outputting the result, we need to a generic interface for outputting the result.

How do we represent the result?
Where do we output the result?

By separating those two things, we can have a more flexibility on defining output method.

Query-evaluator scheduling

In addition to scheduling within an Evaluator, we need to assign each streaming query to appropriate evaluator.

This scheduling can be tricky, so we can start from the simple approach (like random assignment, round-robin) and measure the performance. After that, we can improve the performance by leveraging some performance metric (CPU, memory, ...) for query assignment.

API: Implement stdout output

As a basic method of outputting result, we should provide a way to print the result on stdout.

API: Implement local file output

As a simple output method along with stdout printing, we should also support local file output.

Define input data stream inside the internal query

We need to define a way for the input data stream to be represented inside the internal query interface.

I think we can use a simple tuple interface with each field's name defined for this. We do not have to define a specific key here.

Implement low-level API

We need to have a low-level representation of MIST query. This should be done before issue #5.

I don't think we need to have a data-flow representation as Storm does, because we assumed that each query is processed in one machine. So, I think we can focus on the issues below when designing internal APIs.

How can we fetch the data?
How can we transform the query state for the given data?
How can we output the result?

API: Support HDFS for data source

We need to make HDFS available as a one of the data source option, as long as other options.

Should be addressed with #11

Task: Avoiding creating a new objects whenever computes inputs

Current operations, such as filter and map, generate a new List object to create outputs.
Maybe it would be inefficient for memory usage. https://github.com/cmssnu/mist/pull/58/files#diff-4514470886a0b95fa75ed1d825e66dddR48

We need to take a look at it at some time.

API: Implement HDFS file output

We need to support save the output to distributed file systems, like HDFS.

Task: Implement a simple round-robin allocator

MistTask allocates OperatorChains to executors.
Then, the executors run and activate the dedicate OperatorChains when their inputs are sent.

For the first step of the above logic, we will simply implement a simple round-robin allocator
which dedicates OperatorChains in round-robin way.

API: Support user-defined functions

To support various operations which cannot be expressed via basic operators, we need to provide an interface for defining UDF in MIST API

API: Support Tuple interface and keyed stream in MIST API

To facilitate key-based operations like reduceByKey, we need to support Tuple as well as keyed stream in MIST API.

SSM: Read/Write states

Make getState() and setState() to allow the query element to read/write from the SSM db.

Support multiple queries in one Evaluator

To increase the number of queries which can be processed in one machine, we need to unify the runtime environment for many queries to reduce the overhead from maintaining large number of queries. Currently, REEF Evaluator is a separate process for one container from RM, so we need to process multiple queries in a single Evaluator.

We need to address those particular issues below.

Put multiple queries in one Task via user API (issue #5)
Separate each query processing in one REEF evaluator and schedule them within the evaluator
Leverage disk space to store large number of query states

Implement input data transformation interface

Internal query should have a concrete information of how data should be transformed. We will process this issue by providing interface for (1) setting the input data from issue #11 (2) registering arbitrary transforming functions to the query (3) chaining those functions in order. In one query, multiple chains could be used in case of join operation.

For (2), we could leverage for lambda function to make an interface simple.

Shorten the gap of generating words

The basic experiments shows that the most important bottleneck of microstreams in current REEF system is a memory limitation.

But, CPU also can be a bottleneck if queries requires more computations.

In this experiment, we examine it by shortening the gap of generating words from 1 second to lower.

Add a default test package and related dependencies

SSM: Putting queries into the SSM

Currently, we have only thought about putting query elements' states into the SSM,
we should consider putting the Query itself into the SSM. (for query statistics etc.)

API: Make basic MISTStream interface

We need to make basic MISTStream interface to represent created-by-user data stream. Basic MISTStream interfaces are

SourceStream
OperatorStream

For those streams, MIST system need know about the type and other necessary informations, so MISTStream and its derivative classes should have methods for those information.

SSM: Install and test DBs

Of the DBs, find the best and install to test its performance

SSM: Memory caching

The SSM should control which states go into the disk, and which remain in the memory.
We need to make decisions on what kind of caching strategy we will use

Design user API

We need a good user API for the system. Particularly, we need to hide the internal implementation of MIST including REEF components. I think we should also hide the data flow model below streaming queries and provide data-centric interface for users.

To be more specific, we need to provide those three features below via our MIST API.

Data fetching from diverse sources (e.g. HDFS, KAFKA, ...)
Data transformation within queries. It should be explicit and general enough to support various operations.
Convenient output of query results

Issue #3 should be addressed before this issue.

SSM: Find DB and Benchmarks

Find appropriate DBs for the SSM
and organize them into a single table showing their characteristics

Task: Convert Logical plan to physical plan

A logical plan is represented by a data flow, which is generated by user-level API.
A physical plan is represented by actual mist operators, in which the number of parallelism is set.

We should convert a logical plan to a physical plan in mist task.

Task: Implement improved scheduling policy in executor

After implementing #41, we need to develop an improved scheduling policy in executor.

This need a discussion.

Task: Implement mist executor

First step, we need to implement simple mist executor in which the scheduling policy is FIFO.
One executor consists of a queue, a thread and a scheduler. The thread fetches tasks from the queue and processes the tasks.

We will make scheduling policy pluggable, in order to change the policy easily.

Designing an optimal scheduling policy is future work.

API: Support NCS for data source

Our API need to support network input connection for fetching data. For network implementation, we can leverage NCS implemetation from Wake.

Should be addressed with issue #11

Task: Support auto-parallelism in operators

We need to think about how to parallelize operators. There are two ways:

Statically decides the number of parallelism when generating physical plans with cost-base optimization(?)
Dynamically changes the number of parallelism by monitoring the operator stages and detecting bottlenecks.
This topic needs a discussion.

First, we can explicitly configure the operator parallelism before doing auto-parallelism.

API: Make NCS source configuration builder for providing stream source configuration

To provide source configuration, we need a source configuration builder for users to provide configurations for stream sources.

It will store its configuration in form of a key-value map, and each different type of source will have different setter methods according to its type.

This is not related to Tang Configuration and ConfigurationBuilder.