Giter VIP home page Giter VIP logo

awesome-hadoop's Introduction

Awesome Hadoop

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources. Inspired by Awesome PHP, Awesome Python and Awesome Sysadmin

Hadoop

  • Apache Hadoop - Apache Hadoop
  • Apache Tez
  • SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
  • GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
  • Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
  • dumbo - Python module that allows you to easily write and run Hadoop programs.
  • hadoopy - Python MapReduce library written in Cython.
  • mrjob - mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
  • pydoop - Pydoop is a package that provides a Python API for Hadoop.
  • hdfs-du - HDFS-DU is an interactive visualization of the Hadoop distributed file system.
  • White Elephant - Hadoop log aggregator and dashboard
  • Kiji Project
  • Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
  • Kylin - Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.

YARN

  • Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.
  • Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
  • mpich2-yarn - Running MPICH2 on Yarn

NoSQL

Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.

  • Apache HBase - Apache HBase
  • Apache Phoenix - A SQL skin over HBase
  • happybase - A developer-friendly Python library to interact with Apache HBase.
  • Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
  • Haeinsa - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
  • hindex - Secondary Index for HBase
  • Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
  • OpenTSDB - The Scalable Time Series Database
  • Apache Cassandra

SQL on Hadoop

SQL on Hadoop

Workflow, Lifecycle and Governance

Data Ingestion and Integration

DSL

**

  • Apache Pig - Apache Pig
  • Apache DataFu - A collection of libraries for working with large-scale data in Hadoop
  • vahara - Machine learning and natural language processing with Apache Pig
  • packetpig - Open Source Big Data Security Analytics
  • akela - Mozilla's utility library for Hadoop, HBase, Pig, etc.
  • seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
  • Lipstick - Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig
  • PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.

Libraries and Tools

Realtime Data Processing

Distributed Computing and Programming

  • Apache Spark
  • Apache Crunch
  • Cascading - Cascading is the proven application development platform for building data applications on Hadoop.
  • Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing.

Packaging, Provisioning and Monitoring

Search

Benchmark

**

Machine learning and Big Data analytics

  • Apache Maout
  • Cloudera Oryx - The Oryx open source project provides simple, real-time large-scale machine learning / predictive analytics infrastructure.
  • MLlib - MLlib is Apache Spark's scalable machine learning library.
  • R - R is a free software environment for statistical computing and graphics.
  • RHive - RHive is an R extension facilitating distributed computing via Apache Hive.
  • RHadoop

Misc.

Resources

Various resources, such as books, websites and articles.

Websites

Useful websites and articles

Presentations

Books

Other Awesome Lists

Other amazingly awesome lists can be found in the awesome-awesomeness list.

awesome-hadoop's People

Contributors

bayandin avatar romainr avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.