Giter VIP home page Giter VIP logo

alfred's Introduction

alfred

Alfred is Your Data Butler!

License Build Status Java Vulnerabilities Javascript Vulnerabilities

Alfred is a custom data ingestion engine that acts as a gatekeeper to prevent ungoverned data from being loaded into a data lake. It allows business users to upload and analyze data themselves. Alfred enables the business user to define and implement files for ingestion. With a simple and intuitive user interface, the customer can provide the file details and submit directly.
This process will automatically perform much of the technical setup and configuration.
This allows user to more quickly determine if data has value and should be promoted to a production process.

The Technology Behind Alfred

Alfred's set of REST services is a Java 7 Spring Boot project. Java 7 was chosen for compatibility with HDFS edge node Java 7 installs. The UI is a React project. The ingestion scripts are written in Python 2.7. It currently has been tested and operates on Hive, HDFS, and a Unix-based system. It has been tested and operates on the Cloudera Quickstart VM, but it is not at all Cloudera dependent.

Alfred's Data Flow

There are 3 types of datasets within Alfred: Sandbox, Production and Refined.

  • Sandbox: this is where business users will often find themselves. They'll create a sandbox dataset to discover data. Perhaps they have a csv with data that they may want to use, but they want to see how it would tie to other data before having a scheduled production-stable dataset. This allows them to upload one-off files and discover what they need.
  • Production: This is for data that is regularly updated. It will be a production-like job that automatically pulls in new data. This data is validated to verify that what is expected is coming in. This way, your data lake won't become a data swamp
  • Refined: This is where Data Scientists and Analysts can set up information about how they are using the data and what datasets they are creating from the "source" systems. This will capture the lineage of data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.