Giter VIP home page Giter VIP logo

dataflowkit's Introduction

dataflowkit

A framework for data scientist and engineering collaboration.

Installation

pip install dataflowkit
pip3 install dataflowkit

Examples

alt tag

Class Diagram

alt tag

Description

Data scientists and engineers have different skill sets. While data scientists focus on algorithms and probability accuracy, engineers focus on data storage and maintance. This is a framwork heavily borrowed the idea from functional programming and other data flow management tools such as SAS.

A program is decoupled into different individuals. With the aid of a design diagram, both parties and easily understand the flow and how to integrate. Following the best practise can greatly shorten the time of development.

Different from other data flow management tools, dataflowkit focus on individual dataflow than batch dataflow.

General Idea

  • A program is decoupled into Recipes and Datasets.
  • Recipe is the calculation component which provide one public method execute(ins, outs)
  • Dataset is the storage component which provide two public methods save(data), load()
  • Datasets can be InMemory, S3, Local, MySql and others
  • Recipes and Datasets are linking to each other and dataframe or dataframe formatable dict should be the format for data transfer.
  • No cyclic flow is allowed
  • Components and be replaced such that code refactoring is easily done
  • Design can be improved and integrated since it follows functional design (components can be merged or split)

General Workflow

  1. Dataflow Analysis - figure out the key components (Recipe and Dataset) for the data flow
  2. Dataflow Design - figure out all components for the data flow
  3. Implementation - implement the algorithms
  4. Code Refactoring - do the code refactoring to increase to readability and maintainability

Responsibilities

Teams have different sizes and members have differnt skill sets. It is recommended each team discuss the responsibilities according to members skill set in the begining clearly. In general, data scientist focus on algorithm implementations and engineers focus on code refactoring. Design is a task which both parties should be involved. It is very important to design before implement and keep the data flow graph updated.

Future Development

A webbased interface for data flow design and code generation will be the next task. You are very welcomed to join us and contribute.

Author: [email protected]

dataflowkit's People

Contributors

icarusso avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.