apache / fluo Goto Github PK

View Code? Open in Web Editor NEW

185.0 32.0 78.0 4.4 MB

Apache Fluo

Home Page: https://fluo.apache.org

License: Apache License 2.0

Thrift 0.07% Shell 2.51% Java 97.43%

big-data fluo accumulo hacktoberfest

fluo's Introduction

Apache Fluo is a distributed processing system that lets users make incremental updates to large data sets. With Apache Fluo, users can set up workflows that execute cross node transactions when data changes. These workflows enable users to continuously join new data into large existing data sets without reprocessing all data. Apache Fluo is built on Apache Accumulo. Check out the Fluo project website for news and general information.

Getting Started

Take the Fluo Tour if you are completely new to Fluo.
Read the Fluo documentation to learn how to install Fluo and start a Fluo application on a cluster where Accumulo, Hadoop & Zookeeper are running. If you need help setting up these dependencies, see the related projects page for external projects that may help.

fluo's People

Stargazers

Watchers

fluo's Issues

Heap Space Error when connecting http client to 9913

When connecting a web browser to 9913, a Heap Space Error is thrown

create script for running benchmark

Create example application and library using CommonCrawl data

Currently, there is a phrasecount example for Accismus. While phrasecount is helpful in learning Accismus, it would be great to have an example app based on more complicated, real world data like a corpus of crawled web data. CommonCrawl provides this data free via AWS.

Create a library for loading CommonCrawl data into Accismus that could be used by developers learning Accismus. Next, create an example application that analyzes the common crawl using Accismus using the library.

This library and example app could also be used to stress Accismus using real-world data.

Set up continuous integration server

Set up a CI server to build Accismus and run integration tests. It would be great if the CI could run the tests for all forks of Accismus so that @keith-turner can check the CI before pulling in commits from that fork.

Decouple Zookeeper configuration from Accumulo configuration

In order to effectively test #42, it would be nice if we could start a Curator TestingServer and stop/restart it for verifying that the OracleServer re-queues itself.

There's no easy way to do this because the Operations.initialize() method is heavily tied to the Accumulo connector. It would be nice to pull out the Zookeeper provisioning methods so that we can add the necessary configuration to Zookeeper without the need for having a running Accumulo instance.

Register io.fluo with Sonatype OSS

This is a way to get your artifacts into maven central when you are not backed by an organization like Apache or Codehaus. AFAIK, it's pretty much the only way. A jira ticket just needs to be filed in their JIRA. It should be posted here so we can track it.

Create verification Map Reduce for Benchmark

The bench mark can be run for an extended period of time to measure percolator performance. It would be nice to be able to verify that the three indexes it creates are consistent, this could be done with a map reduce job.

Allow worker to process notifications for multiple tables

Currently, workers only process notifications on a single Accumulo table.

However, workers can be started for different tables using YARN.

Allow observer notification to be configured in AbstractObserver class

Currently, the notification column of observers are configured in initialization.properties. In production, it will be more maintainable if developers can configure this in their implementation of AbstractObserver. One benefit of this is that developers do not need to track or modify initialization.properties. Also, any changes to the notification column will be tracked with the code of the observer.

Workers can check for conflicts when they start up to insure that two observers not notified on the same column.

Named snapshots

Accismus could possibly have the ability to create a named snapshots, list them, read them, and delete them. For example a named snapshot could be used as input to a map reduce job.

A really nice feature would be the ability to efficiently compute the difference between two named snapshots. ZFS supports a capability like this using merkle trees. Have this capability could make exporting data much simpler. A user could do something like the following

Create named snapshot snap1
wait
Create named snapshot snap2
Use map reduce to write diffs between snap1 and snap2
bulk import deltas into Accumulo table for query
Rename snap2 to snap1
goto 2

Named snapshots would prevent commited data that immediately precedes the snapshot from being garbage collected. At this point I am not sure how to efficiently compute the diffs between snapshots (w/o doing a full table scan).

Leader election for the Oracle Server

This should allow multiple Oracles to be running but only one to be handing out timestamps. In the event that a leader fails, one of the followers should become the leader.

Collect Fluo metrics and expose them to monitoring tools

Make it easier for users to monitor and gain insight into running Fluo cluster. Below are some possible questions to be answered by tool:

How many notifications have been processed by each observer? In past hour, day, week, month?
How many notifications have been processed by each worker?
Are notification waiting to be processed?
Are all workers running?

Change expected version of Java from 1.6 to 1.7

Modify pom.xml, make sure everything compiles, and integration tests run correctly.

Create Fluo developer tool

While the monitoring tool (see #20) is geared for admins to track the health of a Fluo cluster, a tool is needed for developers to help them debug observers. While they have separate use cases, the developer & monitoring tool could be merged into one.

Below are some possible features for Fluo developer tool:

Have worker store last 50 log messages (any level) and last 50 error messages for observers and have tool retrieve them.
Retrieve statistics for individual observers.

Upgrade zookeeper and Curator

Zookeeper 3.4.6 and Curator 2.5.0

OracleServer should check previous leading Oracle upon becoming the lead to make sure it is no longer serving timestamps

Rename Accismus

The thing I like most about Accismus is that its unique. One drawbacks is that its hard to pronounce. Personally, I don't think pronunciation is big problem, because its a one time cost.

The biggest problem is one that @mikewalch recently pointed out, its so close to Accumulo. Sentences like Accismus uses Accumulo are hard on the human mind.

@ericnewton has suggested the name Avalanche. I like this suggestion. I like the implication of scale. I don't like that its a very common word or that it has a connotation of disorder. Percolator continually imposes order on data at scale.

If you have suggestions for a name, leave a comment here.

Use logback in accismus-cluster

A dependency conflict in Accismus has been that Twill requires logback but Zookeeper requires log4j in MiniAccumuloCluster. Due to this conflict, if log4j is configured with Twill, debug-level log messages do not show. With the the accismus-cluster module seperated from accismus-core, this conflict can be resolved by making cluster depend on logback (for twill) and core depend on log4j (for zookeeper).

Change version in pom to 1.0.0-alpha-1-SNAPSHOT

The pom should indicate the version we are working towards releasing, which is currently 1.0.0 alpha.

Aggressively test Oracle

The proper functioning of the Oracle is critical. Need an Oracle test suite that pauses and kills Oracle processes while allocating and verifying timestamps. Some of the situations the test suite should induce are :

Client connected to an oracle process thats no longer the leader
Client connected to an oracle process thats paused
Client connected to an oracle process thats dead

Accismus configuration object should allow the Oracle's port to be set.

OracleClient should be notified when an oracle changes leadership.

This may be covered by the current implementation assuming the oracle process has physically idied then the client will reconnect. It may be possible to listen for the lock node as well to tell when it is updated

Create Accismus output formats

Since Percolator is designed for incremental processing, it will need to to be initialized via map reduce. A file output format that creates rfiles containing data in the Accismus format that can be bulk imported would be one way to enable this. This would wrap the AccumuloFileOutputFormat.

Additionally, AccumuloOutputFormat could be wrapped to allow users to write directly to an Accumulo table using the Accismus data format.

Need to document how these differ from the map reduce output that runs load transactions.

Scrape through codebase and properly mark fields as final

A good idea to make sure that things like Configuration objects on classes that should never change are clearly identified as such.

Run Accismus workers in YARN

Using Twill, enable Accismus workers to run in YARN. Create scripts to simplify the process of submitting applications to YARN.

Initialize.sh script cannot be run twice

The exception below is being thrown if the initialize script is run for the second time (even if accismus.init.zookeeper.clear = true):

Exception in thread "main" java.lang.RuntimeException: org.apache.accumulo.core.client.TableExistsException: Table accismus exists
at accismus.api.Admin.initialize(Admin.java:91)
at accismus.tools.InitializeTool.main(InitializeTool.java:81)

Create M/R input format

Need a map reduce input format thats able to read a snapshot of an Accismus table.

OracleServer should automatically requeue for leadership when zookeeper reconnect occurs

The ephemeral node for leadership election is lost when an OracleServer loses its connection to Zookeeper for any reason. Upon a reconnect event firing, the OracleServer should requeue for leadership.

NPE in TransactionImpl.checkForAckCollision()

Occurred during the following build while running the TrieIT:

https://travis-ci.org/mikewalch/Accismus/builds/28625179

java.lang.NullPointerException
at accismus.impl.TransactionImpl.checkForAckCollision(TransactionImpl.java:504)
at accismus.impl.TransactionImpl.preCommit(TransactionImpl.java:367)
at accismus.impl.TransactionImpl.preCommit(TransactionImpl.java:312)
at accismus.impl.TransactionImpl.commit(TransactionImpl.java:636)
at accismus.impl.Worker.processUpdates(Worker.java:175)
at accismus.impl.WorkerTask.run(WorkerTask.java:66)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

Make recovery more efficient

In the case when a transaction fails, recovery is very inefficient. There are at least two problems with the SnapshotScanner.

For every locked column encountered, the primary column will read to determine the status. This information should be cached in a LRU cache, instead of continually re-reading it.
For each lock encountered that is associated with a failed transaction it will roll it forward or back. The snapshot scanner should read ahead and attempt to find a batch of locked columns. Once a batch is found, all of them can be resolved at once instead of one at a time.

The ParallelSnapShot scanner should also be improved.

MiniAccismus.waitForObservers() returns before processing is finished

In MiniAccismus, the waitForObservers() method is occasionally returning before all observers are done processing. This is because the method returns if there a no notifications available in the table. However, observers could still be running and creating new notifications.

The method should instead check that no observers are running and no new notifications are in the table before returning.

use Curator for leader election and leader discovery

The ORacle needs to use zookeeper for leader election and leader discovery. This can be accomplished with curator.

Finalize API

Use ByteBuffers instead of ByteSequence
Minimize external dependencies in API
Put API in maven module

Add Observer Sets

@mikewalch would like to make it easier for multiple developers to run workers and observers. We were discussing how to do this. One possibility we came up with is having observers sets. Each observer set would have a list of Observers. A developer could kick off a yarn application that runs workers and one or more observers sets. The yarn application would only execute the observers in its sets. The yarn application could reserve the observer set to keep other yarn applications from running those observers.

Support writing observers in Python

Using Jython to enable writing observers in Python would be very nice.

One possible way this could work is to have a JythonObserver that takes python code as configuration. The Transaction and Observer interfaces could be exposed in Python code.

Make workers detect collisions and jump

Need to make the worker detect when its collided with another worker and start scanning in another random place for notifications.

Make garbage collection iterator detemine oldest active snapshot

Currently the garbage collection iterator just keeps N versions of all data. It would be better if it determined the oldest active snapshot timestamp and used that for garbage collection. This could be done if workers, reader, and map reduce jobs periodically registered the oldest snapshot timestamp they were reading using ephemeral nodes in zookeeper.

Register clients that perform transactions in Zookeeper

Register clients (aka transactors) that perform transactions in zookeeper using a ephemeral node. During a transaction, a transactor references the ephemeral node in the primary column of the transaction.

If a transactor dies during a transaction, other clients can check for the existence of the referenced node before rolling back.

If the node is missing, roll back immediately.

If node exists, the transactor process could be alive but broken. Therefore, eventually timeout and rollback.

Modify worker to have only one thread find notifications

Currently each thread in worker looks for notifications

Migrate logging api to slf4j

We can still use log4j as implementation but slf4j allows users to pick different logging implementations (like logback). Also, fixes logging issue when running workers in YARN/Twill.

Implement Acccumulo's UtilWaitThread.sleep() in Accismus

Accismus depends on an Accumulo class called UtilWaitThead but this is not part of Accumulo's public API. The class could be moved/removed at any time. This class should be implemented in Accismus.

Also the sleep() method logs an InterruptedException error when shutting down worker. The worker should implement it's own try/catch and check to see if the worker is shutting down before logging the error.

Create cluster stress test suite

Need a cluster test suite that can verify correctness. If this test suite had the following operations.

Generate data to ingest
Initialize table via map reduce using generated data
Load generated data using load transactions
Transform loaded data
Verify data transformation

Then those operations could be combined to perform a test like the following

Generate Data Set D1
Use map reduce to initialize Accismus table using D1 as input
Generate Data Set D2
Load D2 using load transactions
Wait for load and observers to complete
Verify table is correct

It would be great if the test suite also tested performance. Otherwise, another issue should be created for a performance benchmark.

Improve AccismusFormatter

Right now, the AccismusFormatter outputs columns in the shell using the following format:

row cf cq vis type ts val

Readability would be improved by changing to a format similar to Accumulo:

row cf:cq [vis] ts-type val

Breakup README into sections

Break up instructions into building, deploying, configuring, running.

Use Curator to access zookeeper.

Transition code to using Curator to access zookeper

OracleClient should not be able to request a timestamp from an Oracle which is not the leader

This is a task for #37

Transaction caching layer with read ahead

Currently, transactions have no caching or read-ahead. Implement a caching layer with read-ahead to satisfy the following:

If the user usually read columns A and B, then when they read A, go ahead and fetch B too

If they read A twice, cache A for the second read

apache / fluo Goto Github PK

fluo's Introduction

Getting Started

fluo's People

Stargazers

Watchers

Forkers

fluo's Issues

Recommend Projects

Recommend Topics

Recommend Org