Light

datafibers / df Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 3.0 99 KB

Big Data Swiss Knifes

Home Page: http://www.datafibers.com

License: Apache License 2.0

Java 100.00%

df's Introduction

DataFibers Smart GW

##1.Overview DataFibers - DF is a open source big data smart gateway and data bus for enterprise big data project. It has implemented a generic architecture for both batch and real time processing.

This project is using or will use following technologies.

Vertx (Java 8)
Kafka (API, Connect, Stream)
HDFS API
Flink|Spark

It is a maven multi-module project. It contains following modules

df-reactive-client: Reads a very large file and streams it to server
df-reactive-server: Non-blocking server, that reads stream of data from client, parses data and sends it to Kafka queue.

##2.TODO

Streaming files to Kafka - DONE
Streaming metadata to Kafka - DONE
Streaming files to HDFS - DONE
Batching files to HDFS
Batching files to HIVE
Metadata Store
File watcher
Dashboard for metadata
Transformation framework
Persist framework
Query framework
Integrate Kanaba and Elastic
File ingestion and conversion, flat, xml, csv, mainframe
File header and trailer validation
Data replication across clusters, databases, tables, etc
Data policy supports, such as purging/retaining some rows for compliance reasons
Automatically register data with Hive
Data format interchange
Data deduplication and merge
Data job management and monitoring
Web UI

df's People

Stargazers

Watchers

Forkers

gitter-badger schubertzhu zhanglongjava

df's Issues

CSV file ingestion

This is to ingest csv file into HDFS

Need a demo for streaming

Create a demo for streaming

real-time client
reporting from Kafka

stream file to hdfs has duplication

resolve is as follows

remove handshake from while loop since for each request we only do handshake once
use regular expression .+

Need documentation wiki for setup

VM setup
Step for demo

Configurable persist layer on HDFS

Make the data is archived into Hadoop and/or a file storage web service before it expires from Kafka.
This will be a far away feature. Put it here as placeholder.

DF Agent unblocking isssue

DF Agent is now unblocking with verx.
When 1st thread is not fininshed streaming while 2nd thread starts. There are chances to get both threads's data mixed up. As result, the Kafka will have bad data.

The resolution is to make it blocking to stream file one by one.

Stream file need to watch folder changes

Stream file function requires following improvement

While loop to watch folder changes with timeout
Need to support file filters so that we do not stream arriving files
Need to archive the streamed files somewhere so that we do not messed up

Need to add test case

Test cases are needed for both client and server

Metadata Logic Improvement

Need to update the job status in terms of metadata
For mongo, we can update
For Kafaka, we can send another message

Need better logging system

We need a better logging system, such as log4j.
We also need to log all jobs seperately

Need better logging system

We need a better logging system, such as log4j.
We also need to log all jobs seperately

Streaming to HDFS need imporvement

Current, the streamed data is saved to local file in df server first, then upload to HDFS. If the file is too big, we will not see the file in HDFS. A better way is to start writing to HDFS once the block size is reached. Later, we can merge the file together.

df-data-collector need 24hrs function

DF Demo df-data-collector can only get updated data when US stock is open. For demo purpose, we also need data available when the market is closed. We'll consider to use spoof data and also consider to add China market as another option.

DF Active Server Refactory

Need to refactor the server of DF as follows

Need document for introduction

Need to polish the introduction to publish in the main site

Batching files to HDFS

This is new feature to batch load file to HDFS.
In this case, the file need to be arrived in DF server first. DF just move/copy the file to HDFS.

Add filter and move option for stream files

We should add options to move the files which are processed to some archive folder so that we know the files are processed.
We also need to support filter files to be process, such as files with leading _

In this case, we can collaborate with stream generator to cosume files smoothly

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.