Giter VIP home page Giter VIP logo

fluo's Introduction

Fluo

Build Status Apache License Maven Central Javadoc

Apache Fluo is a distributed processing system that lets users make incremental updates to large data sets. With Apache Fluo, users can set up workflows that execute cross node transactions when data changes. These workflows enable users to continuously join new data into large existing data sets without reprocessing all data. Apache Fluo is built on Apache Accumulo. Check out the Fluo project website for news and general information.

Getting Started

  • Take the Fluo Tour if you are completely new to Fluo.
  • Read the Fluo documentation to learn how to install Fluo and start a Fluo application on a cluster where Accumulo, Hadoop & Zookeeper are running. If you need help setting up these dependencies, see the related projects page for external projects that may help.

fluo's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fluo's Issues

Create example application and library using CommonCrawl data

Currently, there is a phrasecount example for Accismus. While phrasecount is helpful in learning Accismus, it would be great to have an example app based on more complicated, real world data like a corpus of crawled web data. CommonCrawl provides this data free via AWS.

Create a library for loading CommonCrawl data into Accismus that could be used by developers learning Accismus. Next, create an example application that analyzes the common crawl using Accismus using the library.

This library and example app could also be used to stress Accismus using real-world data.

Set up continuous integration server

Set up a CI server to build Accismus and run integration tests. It would be great if the CI could run the tests for all forks of Accismus so that @keith-turner can check the CI before pulling in commits from that fork.

Decouple Zookeeper configuration from Accumulo configuration

In order to effectively test #42, it would be nice if we could start a Curator TestingServer and stop/restart it for verifying that the OracleServer re-queues itself.

There's no easy way to do this because the Operations.initialize() method is heavily tied to the Accumulo connector. It would be nice to pull out the Zookeeper provisioning methods so that we can add the necessary configuration to Zookeeper without the need for having a running Accumulo instance.

Register io.fluo with Sonatype OSS

This is a way to get your artifacts into maven central when you are not backed by an organization like Apache or Codehaus. AFAIK, it's pretty much the only way. A jira ticket just needs to be filed in their JIRA. It should be posted here so we can track it.

Create verification Map Reduce for Benchmark

The bench mark can be run for an extended period of time to measure percolator performance. It would be nice to be able to verify that the three indexes it creates are consistent, this could be done with a map reduce job.

Allow observer notification to be configured in AbstractObserver class

Currently, the notification column of observers are configured in initialization.properties. In production, it will be more maintainable if developers can configure this in their implementation of AbstractObserver. One benefit of this is that developers do not need to track or modify initialization.properties. Also, any changes to the notification column will be tracked with the code of the observer.

Workers can check for conflicts when they start up to insure that two observers not notified on the same column.

Named snapshots

Accismus could possibly have the ability to create a named snapshots, list them, read them, and delete them. For example a named snapshot could be used as input to a map reduce job.

A really nice feature would be the ability to efficiently compute the difference between two named snapshots. ZFS supports a capability like this using merkle trees. Have this capability could make exporting data much simpler. A user could do something like the following

  1. Create named snapshot snap1
  2. wait
  3. Create named snapshot snap2
  4. Use map reduce to write diffs between snap1 and snap2
  5. bulk import deltas into Accumulo table for query
  6. Rename snap2 to snap1
  7. goto 2

Named snapshots would prevent commited data that immediately precedes the snapshot from being garbage collected. At this point I am not sure how to efficiently compute the diffs between snapshots (w/o doing a full table scan).

Leader election for the Oracle Server

This should allow multiple Oracles to be running but only one to be handing out timestamps. In the event that a leader fails, one of the followers should become the leader.

Collect Fluo metrics and expose them to monitoring tools

Make it easier for users to monitor and gain insight into running Fluo cluster. Below are some possible questions to be answered by tool:

How many notifications have been processed by each observer? In past hour, day, week, month?
How many notifications have been processed by each worker?
Are notification waiting to be processed?
Are all workers running?

Create Fluo developer tool

While the monitoring tool (see #20) is geared for admins to track the health of a Fluo cluster, a tool is needed for developers to help them debug observers. While they have separate use cases, the developer & monitoring tool could be merged into one.

Below are some possible features for Fluo developer tool:

  1. Have worker store last 50 log messages (any level) and last 50 error messages for observers and have tool retrieve them.
  2. Retrieve statistics for individual observers.

Rename Accismus

The thing I like most about Accismus is that its unique. One drawbacks is that its hard to pronounce. Personally, I don't think pronunciation is big problem, because its a one time cost.

The biggest problem is one that @mikewalch recently pointed out, its so close to Accumulo. Sentences like Accismus uses Accumulo are hard on the human mind.

@ericnewton has suggested the name Avalanche. I like this suggestion. I like the implication of scale. I don't like that its a very common word or that it has a connotation of disorder. Percolator continually imposes order on data at scale.

If you have suggestions for a name, leave a comment here.

Use logback in accismus-cluster

A dependency conflict in Accismus has been that Twill requires logback but Zookeeper requires log4j in MiniAccumuloCluster. Due to this conflict, if log4j is configured with Twill, debug-level log messages do not show. With the the accismus-cluster module seperated from accismus-core, this conflict can be resolved by making cluster depend on logback (for twill) and core depend on log4j (for zookeeper).

Aggressively test Oracle

The proper functioning of the Oracle is critical. Need an Oracle test suite that pauses and kills Oracle processes while allocating and verifying timestamps. Some of the situations the test suite should induce are :

  • Client connected to an oracle process thats no longer the leader
  • Client connected to an oracle process thats paused
  • Client connected to an oracle process thats dead

Create Accismus output formats

Since Percolator is designed for incremental processing, it will need to to be initialized via map reduce. A file output format that creates rfiles containing data in the Accismus format that can be bulk imported would be one way to enable this. This would wrap the AccumuloFileOutputFormat.

Additionally, AccumuloOutputFormat could be wrapped to allow users to write directly to an Accumulo table using the Accismus data format.

Need to document how these differ from the map reduce output that runs load transactions.

Run Accismus workers in YARN

Using Twill, enable Accismus workers to run in YARN. Create scripts to simplify the process of submitting applications to YARN.

Initialize.sh script cannot be run twice

The exception below is being thrown if the initialize script is run for the second time (even if accismus.init.zookeeper.clear = true):

Exception in thread "main" java.lang.RuntimeException: org.apache.accumulo.core.client.TableExistsException: Table accismus exists
at accismus.api.Admin.initialize(Admin.java:91)
at accismus.tools.InitializeTool.main(InitializeTool.java:81)

NPE in TransactionImpl.checkForAckCollision()

Occurred during the following build while running the TrieIT:

https://travis-ci.org/mikewalch/Accismus/builds/28625179

java.lang.NullPointerException
at accismus.impl.TransactionImpl.checkForAckCollision(TransactionImpl.java:504)
at accismus.impl.TransactionImpl.preCommit(TransactionImpl.java:367)
at accismus.impl.TransactionImpl.preCommit(TransactionImpl.java:312)
at accismus.impl.TransactionImpl.commit(TransactionImpl.java:636)
at accismus.impl.Worker.processUpdates(Worker.java:175)
at accismus.impl.WorkerTask.run(WorkerTask.java:66)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

Make recovery more efficient

In the case when a transaction fails, recovery is very inefficient. There are at least two problems with the SnapshotScanner.

  • For every locked column encountered, the primary column will read to determine the status. This information should be cached in a LRU cache, instead of continually re-reading it.
  • For each lock encountered that is associated with a failed transaction it will roll it forward or back. The snapshot scanner should read ahead and attempt to find a batch of locked columns. Once a batch is found, all of them can be resolved at once instead of one at a time.

The ParallelSnapShot scanner should also be improved.

MiniAccismus.waitForObservers() returns before processing is finished

In MiniAccismus, the waitForObservers() method is occasionally returning before all observers are done processing. This is because the method returns if there a no notifications available in the table. However, observers could still be running and creating new notifications.

The method should instead check that no observers are running and no new notifications are in the table before returning.

Finalize API

Use ByteBuffers instead of ByteSequence
Minimize external dependencies in API
Put API in maven module

Add Observer Sets

@mikewalch would like to make it easier for multiple developers to run workers and observers. We were discussing how to do this. One possibility we came up with is having observers sets. Each observer set would have a list of Observers. A developer could kick off a yarn application that runs workers and one or more observers sets. The yarn application would only execute the observers in its sets. The yarn application could reserve the observer set to keep other yarn applications from running those observers.

Support writing observers in Python

Using Jython to enable writing observers in Python would be very nice.

One possible way this could work is to have a JythonObserver that takes python code as configuration. The Transaction and Observer interfaces could be exposed in Python code.

Make garbage collection iterator detemine oldest active snapshot

Currently the garbage collection iterator just keeps N versions of all data. It would be better if it determined the oldest active snapshot timestamp and used that for garbage collection. This could be done if workers, reader, and map reduce jobs periodically registered the oldest snapshot timestamp they were reading using ephemeral nodes in zookeeper.

Register clients that perform transactions in Zookeeper

Register clients (aka transactors) that perform transactions in zookeeper using a ephemeral node. During a transaction, a transactor references the ephemeral node in the primary column of the transaction.

If a transactor dies during a transaction, other clients can check for the existence of the referenced node before rolling back.

If the node is missing, roll back immediately.

If node exists, the transactor process could be alive but broken. Therefore, eventually timeout and rollback.

Migrate logging api to slf4j

We can still use log4j as implementation but slf4j allows users to pick different logging implementations (like logback). Also, fixes logging issue when running workers in YARN/Twill.

Implement Acccumulo's UtilWaitThread.sleep() in Accismus

Accismus depends on an Accumulo class called UtilWaitThead but this is not part of Accumulo's public API. The class could be moved/removed at any time. This class should be implemented in Accismus.

Also the sleep() method logs an InterruptedException error when shutting down worker. The worker should implement it's own try/catch and check to see if the worker is shutting down before logging the error.

Create cluster stress test suite

Need a cluster test suite that can verify correctness. If this test suite had the following operations.

  • Generate data to ingest
  • Initialize table via map reduce using generated data
  • Load generated data using load transactions
  • Transform loaded data
  • Verify data transformation

Then those operations could be combined to perform a test like the following

  • Generate Data Set D1
  • Use map reduce to initialize Accismus table using D1 as input
  • Generate Data Set D2
  • Load D2 using load transactions
  • Wait for load and observers to complete
  • Verify table is correct

It would be great if the test suite also tested performance. Otherwise, another issue should be created for a performance benchmark.

Improve AccismusFormatter

Right now, the AccismusFormatter outputs columns in the shell using the following format:

row cf cq vis type ts val

Readability would be improved by changing to a format similar to Accumulo:

row cf:cq [vis] ts-type val

Transaction caching layer with read ahead

Currently, transactions have no caching or read-ahead. Implement a caching layer with read-ahead to satisfy the following:

If the user usually read columns A and B, then when they read A, go ahead and fetch B too

If they read A twice, cache A for the second read

Rename AccismusProperties to ConnectionProperties

the intent of the AccismusProperties helper object was to encapsulate the info needed to connect to Accismus. However the name of the class does not really convey this. I am thinking it should be named to ConnectionProperties.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.