Giter VIP home page Giter VIP logo

carbondata's Introduction

CarbonData

CarbonData is a fully indexed columnar and hadoop native datastore designed for fast analytics and multi-dimension query on big data.In customer benchmarks, CarbonData has been shown to manage Petabyte of data running on extraordinarily low-cost hardware and answers queries around 10 times faster than the current open source solutions (column-oriented SQL on Hadoop data-stores).

Why CarbonData

The CaronData file format provides a highly efficient way to store structured data,it was designed to overcome limitations of the other Hadoop file formats.

  • CarbonData stores data along with index,the index is not stored separately but the carbondata itself is the index.Indexing helps to accelerate query performance and reduces the I/O scans to the minimum and reduces the CPU required for computation on the data.
  • CarbonData is designed to support very efficient compression and global encoding schemes,CarbonData can support actionable compression across data files for query engines to perform all processing on compressed/encoded data without having to convert the data. This improves the processing speed, especially for analytical queries. The data can be converted back to the user readable format just before returning the results to the user.
  • CarbonData is designed with various types of usecases in mind and provides flexible storage options. Data can be stored in completely columnar or row based formats or even in a mix of columnar and row based hybrid format.
  • CarbonData is designed to integrate into big data ecosystem, leverages Apache Hadoop,Apache Spark etc. for distributed query processing.

Features

  • Fast analytic queries in seconds using built-in index, optimized for interactive OLAP-style query, high through put scan query, low latency point query.
  • Fast data loading speed and support incremental load in period of minutes. 5 mins near realtime loading should be supported
  • Support concurrent query.
  • Support time based data retention.
  • Supports SQL based query interface.

CarbonData File Structure and Format

The online document at CarbonData File Structure and Format

Building CarbonData

Prerequisites for building CarbonData:

  • Unix-like environment (Linux, Mac OS X)
  • git
  • Apache Maven (we recommend version 3.0.4)
  • Java 7 or 8
  • Scala 2.10

I. Clone and build CarbonData

$ git clone https://github.com/HuaweiBigData/carbondata.git

II. Go to the root of the source tree

$ cd carbondata

III. Build the project Build without testing:

$ mvn -DskipTests clean install 

Or, build with testing:

$ mvn clean install

Developing CarbonData

The CarbonData committers use IntelliJ IDEA and Eclipse IDE to develop.

IntelliJ IDEA

  • Download IntelliJ at https://www.jetbrains.com/idea/ and install the Scala plug-in for IntelliJ at http://plugins.jetbrains.com/plugin/?id=1347
  • Go to "File -> Import Project", locate the CarbonData source directory, and select "Maven Project".
  • In the Import Wizard, select "Import Maven projects automatically" and leave other settings at their default.
  • Leave other settings at their default and you should be able to start your development.
  • When you run the scala test, sometimes you will get out of memory exception. You can increase your VM memory usage by the following setting, for example:
-XX:MaxPermSize=512m -Xmx3072m

You can also make those setting to be the default by setting to the "Defaults -> ScalaTest".

Eclipse

  • Download the Scala IDE (preferred) or install the scala plugin to Eclipse.
  • Import the CarbonData Maven projects ("File" -> "Import" -> "Maven" -> "Existing Maven Projects" -> locate the CarbonData source directory).

Fork and Contribute

This is an open source project for everyone, and we are always open to people who want to use this system or contribute to it. This guide document introduce how to contribute to CarbonData.

About

CarbonData project original contributed from the Huawei, in progress of donating this open source project to Apache Software Foundation for leveraging big data ecosystem.

carbondata's People

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.