Giter VIP home page Giter VIP logo

df's Introduction

DataFibers Smart GW

Gitter Build Status ##1.Overview DataFibers - DF is a open source big data smart gateway and data bus for enterprise big data project. It has implemented a generic architecture for both batch and real time processing.

This project is using or will use following technologies.

  • Vertx (Java 8)
  • Kafka (API, Connect, Stream)
  • HDFS API
  • Flink|Spark

It is a maven multi-module project. It contains following modules

  • df-reactive-client: Reads a very large file and streams it to server
  • df-reactive-server: Non-blocking server, that reads stream of data from client, parses data and sends it to Kafka queue.

##2.TODO

  • Streaming files to Kafka - DONE
  • Streaming metadata to Kafka - DONE
  • Streaming files to HDFS - DONE
  • Batching files to HDFS
  • Batching files to HIVE
  • Metadata Store
  • File watcher
  • Dashboard for metadata
  • Transformation framework
  • Persist framework
  • Query framework
  • Integrate Kanaba and Elastic
  • File ingestion and conversion, flat, xml, csv, mainframe
  • File header and trailer validation
  • Data replication across clusters, databases, tables, etc
  • Data policy supports, such as purging/retaining some rows for compliance reasons
  • Automatically register data with Hive
  • Data format interchange
  • Data deduplication and merge
  • Data job management and monitoring
  • Web UI

df's People

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

df's Issues

Configurable persist layer on HDFS

Make the data is archived into Hadoop and/or a file storage web service before it expires from Kafka.
This will be a far away feature. Put it here as placeholder.

DF Agent unblocking isssue

DF Agent is now unblocking with verx.
When 1st thread is not fininshed streaming while 2nd thread starts. There are chances to get both threads's data mixed up. As result, the Kafka will have bad data.

The resolution is to make it blocking to stream file one by one.

Stream file need to watch folder changes

Stream file function requires following improvement

  • While loop to watch folder changes with timeout
  • Need to support file filters so that we do not stream arriving files
  • Need to archive the streamed files somewhere so that we do not messed up

Metadata Logic Improvement

  • Need to update the job status in terms of metadata
  • For mongo, we can update
  • For Kafaka, we can send another message

Streaming to HDFS need imporvement

Current, the streamed data is saved to local file in df server first, then upload to HDFS. If the file is too big, we will not see the file in HDFS. A better way is to start writing to HDFS once the block size is reached. Later, we can merge the file together.

df-data-collector need 24hrs function

DF Demo df-data-collector can only get updated data when US stock is open. For demo purpose, we also need data available when the market is closed. We'll consider to use spoof data and also consider to add China market as another option.

DF Active Server Refactory

Need to refactor the server of DF as follows

  • Use event bus
  • Split code to different sevice verticle
  • Redefine/Polish MetaData
  • Redefine/Polish communication protocal
  • Use HTTPS
  • Add authentication

Batching files to HDFS

This is new feature to batch load file to HDFS.
In this case, the file need to be arrived in DF server first. DF just move/copy the file to HDFS.

Add filter and move option for stream files

We should add options to move the files which are processed to some archive folder so that we know the files are processed.
We also need to support filter files to be process, such as files with leading _

In this case, we can collaborate with stream generator to cosume files smoothly

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.