snuspl / pluto Goto Github PK
View Code? Open in Web Editor NEWMIST: High-performance IoT Stream Processing
License: Apache License 2.0
MIST: High-performance IoT Stream Processing
License: Apache License 2.0
After implementing the #42, we need to develop an improved allocation policy, to reduce data processing latency.
This need a discussion.
We already investigate the limitations of microstream in current stream systems such as Storm and REEF.
We found that we also have to perform a similar investigation to Apache Spark Streaming.
We will implement WordGenerator and WordAggregator, and measure the limitation of processed queries in the same time.
Before we design our user API, we need to fully understand API of state-of-the-art stream processing systems.
To begin with, we decided to explore the user API of Apache Flink and Spark Streaming. This exploration should begin after we get sufficient background on stream processing frameworks.
As a start-up of the project, we should add maven configuration and gitignore file for the project.
We need a basic REEF driver for MIST to . Right now, MIST driver will just interconnect between MIST Client and MIST Task.
Messaging systems (like Kafka, Kineses, MQTT, ...) are widely used as a data source in stream processing. We are planning to Kafka first and make others available in the future.
Should be addressed with #11
MISTExecutionEnvironment is responsible for controlling user queries. We first need to implement user query submission part, including avro serialization & query verification.
Our API should be able to fetch data from multiple issues (Network, HDFS, KAFKA, ...) but we do need to implement a generic interface for configuring those data sources.
Mist should support various streaming operations, such as filter, flatMap, map, reduce.
In MistTask, the operations are represented as operators.
For the stepping stone, we need to implement basic stream operators: MapOperator, FilterOperator, and so on.
For automatic check of code readability and formatting, we need to add checkstyle
to mist.
We need to implement an internal operation representation inside the MIST query. Specifically, we need to define a procedure of how a single data would be processed inside MIST.
We should have a clear definition of ...
Each would be addressed on separate issues.
As the first step of MIST project, we need to implement a simple word generator-aggregator stream processing application on REEF. This issue is important for two reasons below.
Physical planner converts the logical plan int a physical plan according to the 1) structure of the logical plan (DAG), 2) element sharing and 3) parallelism.
First, we consider 1).
We need to implement a basic physical planner which just converts the logical plan into a physical plan.
The planner should decide which elements are executed in synchronous or asynchronous stages.
If the elements are performed linearly, the planner just allocates the elements into synchronous stages ( in the same thread).
If there are branches in the logical plan, the planner allocates asynchronous stages to the downstream elements ( in the different threads).
As the first step of the ISSUE #1 , I will implement a simple word counter with one task.
The task performs both word generating and aggregating.
I'm going to implement it, starting with HelloREEF
example REEF application.
To process multiple queries in one machine, we need to store the states of those queries. However, memory space can be not enough for storing all the states. This case can happen in some queries like online ML (need to store many parameters).
The basic approach for this is to store its state in disk when the query is inactive, and reload it to memory and process when the data comes in. However, reading data from the disk should be slow so processing time can increase. Because of that, we need to address those things below.
We need to measure physical operator's
As a first step of Issue #7, we will implement the on-demand query state loading feature on MIST. In this feature
We'll meausre the performance of this approach and find some ways to improve.
Some Java objects could be too complicated to be represented via simple string, so we should provide more elaborate way for serializing it. We can use Apache Avro for this.
Currently WordCounter app contains WordGeneratorTask and WordAggregatorTask.
I'll separate the two task to two different application to make them run in different machines.
Because we need to get some background on stream processing, we need to review some important papers on stream processing. The important papers I think are listed below.
Taehun and I will review those papers in a short time and have a short discussion on it. This issue will remain unclosed in case we need more papers to read.
Please comment to this issue if you come up with more papers worth reading.
After data is transformed, internal query state should be updated by the newly input data. For that we should provide a interface for
(1) defining internal state
(2) a updating function with old state and input data
(3) defining window information (window size & slide interval)
For outputting the result, we need to a generic interface for outputting the result.
By separating those two things, we can have a more flexibility on defining output method.
In addition to scheduling within an Evaluator, we need to assign each streaming query to appropriate evaluator.
This scheduling can be tricky, so we can start from the simple approach (like random assignment, round-robin) and measure the performance. After that, we can improve the performance by leveraging some performance metric (CPU, memory, ...) for query assignment.
As a basic method of outputting result, we should provide a way to print the result on stdout.
As a simple output method along with stdout printing, we should also support local file output.
We need to define a way for the input data stream to be represented inside the internal query interface.
I think we can use a simple tuple interface with each field's name defined for this. We do not have to define a specific key here.
We need to have a low-level representation of MIST query. This should be done before issue #5.
I don't think we need to have a data-flow representation as Storm does, because we assumed that each query is processed in one machine. So, I think we can focus on the issues below when designing internal APIs.
We need to make HDFS available as a one of the data source option, as long as other options.
Should be addressed with #11
Current operations, such as filter and map, generate a new List
object to create outputs.
Maybe it would be inefficient for memory usage. https://github.com/cmssnu/mist/pull/58/files#diff-4514470886a0b95fa75ed1d825e66dddR48
We need to take a look at it at some time.
We need to support save the output to distributed file systems, like HDFS.
MistTask allocates OperatorChains to executors.
Then, the executors run and activate the dedicate OperatorChains when their inputs are sent.
For the first step of the above logic, we will simply implement a simple round-robin allocator
which dedicates OperatorChains in round-robin way.
To support various operations which cannot be expressed via basic operators, we need to provide an interface for defining UDF in MIST API
To facilitate key-based operations like reduceByKey
, we need to support Tuple
as well as keyed stream in MIST API.
Make getState() and setState() to allow the query element to read/write from the SSM db.
To increase the number of queries which can be processed in one machine, we need to unify the runtime environment for many queries to reduce the overhead from maintaining large number of queries. Currently, REEF Evaluator is a separate process for one container from RM, so we need to process multiple queries in a single Evaluator.
We need to address those particular issues below.
Internal query should have a concrete information of how data should be transformed. We will process this issue by providing interface for (1) setting the input data from issue #11 (2) registering arbitrary transforming functions to the query (3) chaining those functions in order. In one query, multiple chains could be used in case of join operation.
For (2), we could leverage for lambda function to make an interface simple.
The basic experiments shows that the most important bottleneck of microstreams in current REEF system is a memory limitation.
But, CPU also can be a bottleneck if queries requires more computations.
In this experiment, we examine it by shortening the gap of generating words from 1 second to lower.
Add a default test package and related dependencies
Currently, we have only thought about putting query elements' states into the SSM,
we should consider putting the Query itself into the SSM. (for query statistics etc.)
We need to make basic MISTStream interface to represent created-by-user data stream. Basic MISTStream interfaces are
For those streams, MIST system need know about the type and other necessary informations, so MISTStream and its derivative classes should have methods for those information.
Of the DBs, find the best and install to test its performance
The SSM should control which states go into the disk, and which remain in the memory.
We need to make decisions on what kind of caching strategy we will use
We need a good user API for the system. Particularly, we need to hide the internal implementation of MIST including REEF components. I think we should also hide the data flow model below streaming queries and provide data-centric interface for users.
To be more specific, we need to provide those three features below via our MIST API.
Issue #3 should be addressed before this issue.
Find appropriate DBs for the SSM
and organize them into a single table showing their characteristics
A logical plan is represented by a data flow, which is generated by user-level API.
A physical plan is represented by actual mist operators, in which the number of parallelism is set.
We should convert a logical plan to a physical plan in mist task.
After implementing #41, we need to develop an improved scheduling policy in executor.
This need a discussion.
First step, we need to implement simple mist executor in which the scheduling policy is FIFO.
One executor consists of a queue, a thread and a scheduler. The thread fetches tasks from the queue and processes the tasks.
We will make scheduling policy pluggable, in order to change the policy easily.
Designing an optimal scheduling policy is future work.
Our API need to support network input connection for fetching data. For network implementation, we can leverage NCS
implemetation from Wake
.
Should be addressed with issue #11
We need to think about how to parallelize operators. There are two ways:
First, we can explicitly configure the operator parallelism before doing auto-parallelism.
To provide source configuration, we need a source configuration builder for users to provide configurations for stream sources.
It will store its configuration in form of a key-value map, and each different type of source will have different setter methods according to its type.
This is not related to Tang Configuration and ConfigurationBuilder.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.