Giter VIP home page Giter VIP logo

zebra's Introduction

zebra

Column-oriented binary format for immutable datasets

Overview

zebra is a typed data format for storing arbitrary combinations of sums, products and arrays in compressed form. It achieves high compression by decomposing this data in to a "struct of arrays" representation on disk. zebra stores data in blocks so that files can be streamed without loading the entire file in memory.

Column-oriented

Image from the Dremel paper

The "struct of arrays" representation used on disk is very similar to that used by Dremel and Parquet, and is known as column oriented, as opposed to row or record oriented. This allows for higher compression as similar data is grouped together. One notable difference that Zebra has from Dremel and Parquet is that Zebra supports full sum types (i.e. Rust-like enums) rather than just optional fields.

What this looks like on disk can be hard to visualise, here are a number of examples which show how logical types and data are flattened into the physical column oriented representation which is used on disk. You can see in the examples that arrays which exist in adjacent rows are simply concatenated and a segment descriptor with the array lengths is stored to keep track of which values belong to which row. Sum types use a tag to indicate which variant is relevant for a given row.

Example 1

Type

Data

Example 2

Type

Data

Example 3

Type

Data

Example 4

Type

Data

Compression

Zebra uses a number of techniques to achieve a relatively high compression ratio, given the speed at which it can be encoded and decoded.

When compressing integers, Zebra uses three techniques stacked on top of each other. First, it uses a frame of reference encoding which offsets each value in the column by the midpoint of the minimum and the maximum so that big numbers become small numbers.

Zebra then uses zig-zag encoding to make all numbers positive, so that small negative numbers can be encoded using a small number of bits.

Finally Zebra finds the maximum number of bits required to store each of the numbers in the column. If that happens to be 6-bits as in the example below, then 16 x 6-bit numbers can be packed in to 6 x 16-bit numbers. All of these techniques are detailed and benchmarked in Lemire 2012.

For strings, Zebra simply uses Snappy compression, however because data is organised in columns, it achieves a higher than usual compression ratio.

Mandatory Reviewers

Given the current state of this library, its performance sensitivity, and is criticality to ongoing production jobs that are still in a very early/tentative state. There should be some mandatory review, for production implications. Due to time and availability constraints this is currently @jystic. So any reviews should hold off on merge until he can sweep. Whilst this might be a bit more restrictive then we normally are on most libraries, it is the safest option for the short term while stability is low / churn is high, and it remains on critical path for many production tasks.

zebra's People

Contributors

amosr avatar erikd-ambiata avatar jacobstanley avatar markhibberd avatar nhibberd avatar tranma avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.