Giter VIP home page Giter VIP logo

amoeba's Introduction

Amoeba

Amoeba is a relational storage manager built on top of HDFS for the Spark/Hadoop stack.

Amoeba is geared towards accelerating ad-hoc query analysis. A user can start analyzing his dataset immediately on upload and the partitioning layout of the data improves over time based on user queries. It does this by splitting the dataset into block-sized partitions based on a partitioning tree. The partitioning tree is a multi-attribute binary tree index which is created with no workload knowledge and incrementally adapted based on user queries.

The user interacts via two components:

  • Upfront Parititoner: Loads the data into Amoeba
  • Adaptive Query Executor: Used to run queries against the tables in Amoeba

Linux / Mac OS X

  • Install and run the following dependencies

    • hadoop-2.6
    • spark-1.6.0
    • zookeeper-3.4.6
  • Install gradle

    • Mac brew install gradle
    • Ubuntu sudo apt-get install gradle
  • Build Amoeba JAR cd amoeba; gradle shadowJar

  • Amoeba uses fabric for automation. Install it sudo pip install fabric

Running a simple example

  • Generate a simple dataset with 1000 tuples and 2 integer attributes A and B

    cd data; python gen_simple.py

  • Modify the Amoeba configuration to reflect your current hadoop/spark/zookeeper installation

    cd ../conf; vim main.properties

  • Upload the dataset into Amoeba

    cd ../scripts
    fab setup:local_simple create_table_info bulk_sample_gen create_robust_tree write_partitions
    

    setup:local_simple picks up the configuration from fabfile/confs.py for the simple dataset. You need to add configuration for your dataset into this file.

    create_table_info creates a table entry for the dataset

    bulk_sample_gen samples the dataset in parallel

    create_robust_tree builds the partitioning tree for the dataset

    write_partitions write the dataset partitioned into HDFS

  • To run a bunch of queries

    cp ../data/simple_queries.log ~/queries.log
    fab setup:local_simple run_queries_formatted
    

The wiki contains additional information about setup and using Amoeba.

amoeba's People

Contributors

aaronelmore avatar alekhjindal avatar anilshanbhag avatar luyi0619 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

amoeba's Issues

Handle skew for low distinct attributes

In the index builder, after splitting the sample based on attribute chosen use the sample sizes of the splits to adjust the allocation to be deducted in the left and right branches.

Create one index builder

We have many variations of the index builder. Remove all except one which will be the one used .

Support adding multiple predicates into the tree.

Currently we just choose the best plan among a set of plans, each of which tries to add just one of the predicates.
Adding the other predicates into the best plan would be the most obvious option.

Improve SparkFileSplit format.

Recommended limit is 100KB but we exceed it.
SparkFileSplit has paths[] put explicitly. We can probably remove the base path and make it a variable. This will greatly reduce the size.

Perf Improvement Needed

Running the following query

SELECT COUNT(*) FROM tpch WHERE l_shipmode == "MAIL" AND l_receiptdate >= "1997-01-01" AND l_receiptdate < "1998-01-01";

We took 30.972s
While spark took 21.73s

The results did match, so that is good.

Too many files on HDFS opened

Reference URLs
http://hbase.apache.org/book.html#trouble
http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html

2015-09-09 17:07:53,370 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: istc2:50010:DataXceiverServer:
java.io.IOException: Xceiver count 4098 exceeds the limit of concurrent xcievers: 4096
        at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140)
        at java.lang.Thread.run(Thread.java:722)

2015-09-09 17:07:53,384 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: istc2:50010:DataXceiver error processing WRITE_BLOCK operation  src: /128.30.77.86:53903 dst: /128.30.77.88:50010
java.io.EOFException: Premature EOF: no length prefix available
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2203)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:672)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
        at java.lang.Thread.run(Thread.java:722)

2015-09-09 17:07:53,383 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode{data=FSDataset{dirpath='[/data/mdindex/dfs/dfs/data/current]'}, localName='istc2:50010', datanodeUuid='96a62a6a-65de-4007-bc6c-b1acc64b9d14', xmitsInProgress=0}:Exception transfering block BP-336117526-127.0.1.1-1432094327835:blk_1077379710_21271759 to mirror 128.30.77.83:50010: java.io.EOFException: Premature EOF: no length prefix available

Don't keep files open. Open, write and close.

A bug in Predicate

Line 74:

if (TypeUtils.compareTo(this.value, value, this.type) < 0)

should be

if (TypeUtils.compareTo(this.value, value, this.type) > 0)

right?

Zookeeper bucket count incorrect

The write_partitions completed, I think it got stuck on the zookeeper count update.

Run the write_partitions correctly and verify using the sum_bucket_counters call that the bucket count was set correctly.

Sampling seems to be memory intensive

Tried to sample 700MB data with 512MB heap size. This should be easy but get OOM.
One issue might be we store the entire sample. However there might be other things we are storing accidentally.

Change string limit

We set it to 20 or something now. Make sure its atleast 25. Bonus make it var.

A bug in RNode

Line 158:

if (TypeUtils.compareTo(p.value, value, type) < 0)

should be

if (TypeUtils.compareTo(p.value, value, type) <= 0)

right?

For example, the value on the RNode is 10. Even the predicate value is 10, we still do not need to go right.

A bug in Query constructor

public Query(String predString) {
        String[] parts = predString.split(";");
        this.predicates = new Predicate[parts.length - 1];
        for (int i = 0; i < parts.length - 1; i++) {
            this.predicates[i] = new Predicate(parts[i]);
        }
    }

should be

public Query(String predString) {
        String[] parts = predString.split(";");
        this.predicates = new Predicate[parts.length];
        for (int i = 0; i < parts.length; i++) {
            this.predicates[i] = new Predicate(parts[i]);
        }
    }

Make everything thread safe

We have CartilageIndexKeyMT and TypeUtilsMT. Make sure they are correct, make them default and remove the old ones.

CSV parsing incorrect

We do not parse A,"B1,B2",C correctly. These are 3 just fields not four.
We currently just do split(",").

Temporary hack is to use | separator.

Severe bug in InputReader.java

line 187, 199.

If we put \n in RawIndexKey and the last attribute of a record is a string, we cannot go to the right bucket based on RobustTree.

The result is we have non-empty bucket in the index, but empty data blocks.

Keep floats compact in the sample.

Not urgent.
Floats get expanded and are represented with full precision. As a results
1.02 becomes 1.020000001. This makes 227 MB file on parsing -> serializing 277 MB.

Some blocks could be sampled more than once!

We can reproduce the bug by adding System.out.println("~ pos: " + position); at line 143 in https://github.com/mitdbg/mdindex/blob/3ba081a7b345d7bd35566e7cef6f13d92a6b6a55/src/main/java/core/index/build/InputReader.java

You can see some duplicated position. The reason is that if the test fails at line 154, then this block will be sampled more than once!

Conf: /Users/ylu/Documents/workspace/mdindex/conf/cartilage.properties
~ pos: 30720
~ pos: 35840
~ pos: 35840
~ pos: 92160
~ pos: 194560
~ pos: 209920
~ pos: 368640
~ pos: 394240
~ pos: 419840
~ pos: 542720
~ pos: 593920
~ pos: 599040
~ pos: 599040

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.