Giter VIP home page Giter VIP logo

deuce's Introduction

Deuce Build Status

Block-Level De-duplication as a Service.

Note: Deuce is a work in progress and is not currently recommended for production use.

What is Deuce?

In today's web-enabled world we generate a lot of data. This is especially true in the brave new cloud-based IaaS offerings where servers can easily live and die within a matter of minutes. The nature of this data means that a lot of the data is redundant, especially subsets of files. Deuce aims to allow a user to easily store his or her data from disparate sources while de-duplicating the data and storing the resulant blocks and metadata in trusted, durable backend stores (i.e. Open Stack swift, Cassandra, etc). Once a file has been successfully stored in Deuce, it can be retrieved via a simple GET operation.

Focus of this project is intentionally very narrow, following the *NIX notion of small tools that can be pieced together with other tools to do amazing things.

API

Fine details of the API are still being worked out and are subject to change. API docs are available here. API testing collection for Postman Google Chrome is in docs/deuce_postman.json

Features

  • Python 3.3 and 3.4 are currently supported
  • Client-side de-duplication
  • Server-side reconstruction and retrieval of de-duplicated data
  • Pluggable driver support for metadata and block backend stores
  • Cassandra, MongoDB and sqlite drivers for metadata storage
  • Disk storage and Open Stack Swift supported for block storage
  • Designed from the ground up to work well in Open Stack environments

What Deuce is not?

  • A backup program. It is ideal for being used by a backup system for implementing data, but it in itself is not a backup program.
  • Block storage. In spite of using some common words, Deuce does not aim to compete with block storage solutions such as Open Stack cinder.
  • Object storage (such as Open Stack Swift)

Installation

Trying out Deuce is simple. The default configuration is setup to use sqlite and disk storage drivers. This means that you can be up and running quickly for development and evaluation purposes.

Clone this repo:

  git clone https://github.com/rackerlabs/deuce.git

Install the code

  cd deuce
  virtualenv env
  . env/bin/activate
  python setup.py develop

Copy over config files:

  mkdir ~/.deuce
  cp ini/config.ini ~/.deuce/config.ini

Start it up

  deuce-server

deuce's People

Contributors

benjamenmeyer avatar jc7998 avatar pombredanne avatar powellchristoph avatar raxuanyu avatar sriram-mv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deuce's Issues

Web Hooks

Add webhooks for capturing certain information into other services.

Primary use-case: Validation Service, register/unregister Vaults on creation/deletion

Add ability to list non-Finalized Files

We need to be able to list non-finalized files. Files should also have a ref-modified field on them to support this for the Validation & Cleanup Service variant

Use Cases:

  • Client-side cleanup of non-finalized files
  • Validation & Cleanup Service: cleanup of non-finalized files

PostgresSQL in NoSQL mode

PostgresSQL is releasing functionality that enables master โ†”๏ธ master synchronization so that it can behave much like Cassandra and MongoDB but still provide the resilience of a standard SQL database. This configuration of PostgresSQL should be examined to see if it can be used successfully in a cross-dc environment, and specifically whether it would reduce the costs of running Deuce when compared to using Cassandra.

Optimization: Memoization of Database Calls

Lets make sure that any repeated calls to the databases are memoized, so that the same queries dont get executed twice.

Its also important that we un-memoize on change of states.

This should give us a nice speed boost

Async Deletion

Add the ability to asynchronously delete items

  • Add DELETE verb support to the Blocks/Files/StorageBlocks/Vaults end points that takes a JSON list of the items (blockid/fileid/storage block id/vault name respectively) to be deleted.
  • All deletions must meet the same requirements as the current non-asynchronous versions

Goal is to speed up certain operations such as multi-block and multi-file deletions; clean-up operations. Primarily thinking of the clean-up service (Deuce-Valere) which could utilize the multi-block/storage block deletions to speed up the cleanup operations.

Cassandra Driver can make Deuce lossy

I came across an issue where doing the following would result in data getting cleaned up improperly:

1) Create a vault
2) Add a file (create, assign, upload, finalize)
3) Run the Valere validation & cleanup (command-line) to get a reference point (should have 1 file, 1 block, both be 'current')
4) Delete the file and block
5) Delete the Vault
6) Repeat Steps 1-3; the current data will be incorrect

My test script (available here: https://github.com/BenjamenMeyer/deuce-tools/blob/master/deucevalere/build_env_exhibit_issue.sh) uploads two files, and creates four orphaned blocks as well. It also uploads a third file and deletes only the file. Output should match the following:

Initial Tally

counter count gigabytes megabytes kilobytes bytes
Current 3 1.0734423995018005e-05 0.010992050170898438 11.255859375 11526
Missing 1 - - - -
Expired 1 2.0274892449378967e-06 0.0020761489868164062 2.1259765625 2177
Orphaned 4 2.146884799003601e-05 0.021984100341796875 22.51171875 23052

Post Clean-up

counter count gigabytes megabytes kilobytes bytes
Current 3 1.0734423995018005e-05 0.010992050170898438 11.255859375 11526
Missing 1 - - - -
Expired 1 2.0274892449378967e-06 0.0020761489868164062 2.1259765625 2177
Deleted (Expired) 1 2.0274892449378967e-06 0.0020761489868164062 2.1259765625 2177
Orphaned 4 2.146884799003601e-05 0.021984100341796875 22.51171875 23052
Deleted (Orphaned) 4 2.146884799003601e-05 0.021984100341796875 22.51171875 23052

When I first found the issue, the current blocks in step #3 would be zero, despite having 2 existing files that were suppose to have referenced them.

We thought that adding the consistency settings other than Consistency Level One would resolve this; however, that does not seem to be the case.

Areas to explore:

  • See if it's an issue in Cassandra itself:
    • Build a series of queries that match up with what the script does
    • Turn on Trace mode in the CQL Shell
    • Run the queries
  • See if #272 will solve the problem (it should as it makes each vault uniquely named)
  • Figure out if there is something in Deuce that is causing the problem

Error description message has multiple escape characters \\

This is the description of the error when trying to finalize a file without sending the File Size X-File-Length:

""('Missing header', 'The \"x-file-length\" header is required.')""

Looks like we escape around x-file-length more than necessary.

Vault Name Separation

Add a new table/column family to Cassandra to separate the Vault names from internal references. Internally, Vaults should be referenced by a UUID value generated when the Vault is created.

Suggestion:

Vault Name (user specified)
Vault ID (UUID)
Vault Storage Name (some combined form of the other two)

I suggest the storage name so that we can track it easily to the storage layer without necessarily having to query Cassandra.

X-Expires-At header for blocks

This would be based upon x-ref-modified header, and you would potentially add a TTL to this, which can be left configurable from the config.ini and validated and transformed into an integer in configspec.ini
Example: x-ref-modified: 4

our TTL per config.ini is 2

then x-expires-at : 6

Remember, this header is only to be returned if the reference count is zero.

The cleanup and validation service should pick the block up after that time and delete it.

How does Deuce work with Encrypted Files?

Explore how Deuce works with encrypted files using AES CBC mode. We should determine if Deuce can successfully de-duplicate databases encrypted using the scheme in the agent.

List bad blocks per file

Create an endpoint, that would list all the bad blocks per file.

GET v1/vaults/{vaultid}/files/{fileid}/blocks?bad_only=True

pagination should be handled for this and it would be preferable if this information is stored in a separate column family ( since it would be easier to paginate, and also cache)

Intermittent Failure: response does not contain headers (transaction-id)

This is an intermittent issue that only happens when the API Tests are run using Python 3 (3.4).
The test:
test_upload_storage_block (tests.api.tests.test_block_storage.TestBlockUploaded)

fails when trying to verify the headers returned.
The test tries to perform an illegal operation on the Block Storage Endpoint (try to upload/PUT a block).

The response obtained is:

RESPONSE RECEIVED

response status..: <Response [405]>
response time....: 0.14439654350280762
response headers.: {'Content-Length': '0', 'Date': 'Fri, 23 Jan 2015 18:32:31 GMT', 'Via': '1.0 570080-ATL1WWSG01.secops.rackspace.com', 'Connection': 'close', 'Age': '1'}
response body....: b''

Keep in mind that it is an intermittent issue that is only happening when the API Tests runs with Python 3.

SQLite Driver does not support performance scenarios

The SQLite Metadata Driver is designed as a reference driver; however, if deploying with it and using multi-processing the deployment will run into issues as the client will improperly get errors, namely due to locks in access of the in-memory SQLite Database.

One method of solving this is to add another SQL-based Database Metadata driver, e.g PostgresSQL.
Intent was deployment on a NoSQL backend.

Deuce Load Testing

  1. Define the topology of the layout.
  2. Create a cluster of metadata servers (Mongo, Cassandra).
  3. Create a cluster of Deuce web heads and Load Balancers
  4. Run stress tests and find any potential coding faults.

Structural Cleanups

The models and controllers have mixed logic. Logic should be in the models, supports, and drivers as appropriate, not in the controllers.

Error Conditions

Check the drivers to ensure that all errors are appropriately generated.

For instance, there are some cases where the Metadata drivers do not raise exceptions when they should.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.