onyx-platform / onyx Goto Github PK

View Code? Open in Web Editor NEW

2.0K 122.0 206.0 16.61 MB

Distributed, masterless, high performance, fault tolerant data processing

Home Page: http://www.onyxplatform.org

License: Eclipse Public License 1.0

Clojure 99.72% Shell 0.28%

clojure distributed streaming batch data

onyx's Introduction

Onyx

What is it?

a masterless, cloud scale, fault tolerant, high performance distributed computation system
batch and stream hybrid processing model
exposes an information model for the description and construction of distributed workflows
Competes against Storm, Flink, Cascading, Cascalog, Spark, Map/Reduce, Sqoop, etc
written in pure Clojure

What would I use this for?

Realtime event stream processing
CQRS
Continuous computation
Extract, transform, load
Data transformation à la map-reduce
Data ingestion and storage medium transfer
Data cleaning

Installation

Available on Clojars:

[org.onyxplatform/onyx "0.14.6"]

Changelog

Changelog can be found at changes.md.

Quick Lookup Doc

A searchable set of documentation for the Onyx data model is available.

Project Template

A project template can be found at onyx-template.

Plugins and Libraries

Plugin Template

We provide a plugin template for use in building new plugins. This can be found at onyx-plugin.

Plugin Use

To use the supported plugins, please use version coordinates such as [org.onyxplatform/onyx-amazon-sqs "0.14.6.SNAPSHOT.0"], and read the READMEs on the 0.14.x branches linked above.

Build Status

Component	`release`	`unstable`
onyx core
onyx-local-rt
onyx-kafka
onyx-datomic
onyx-redis
onyx-sql
onyx-bookkeeper
onyx-amazon-sqs
onyx-amazon-s3
onyx-http
learn-onyx		`-`
onyx-examples
onyx-peer-http-query
lib-onyx
onyx-plugin
onyx-template

release: stable, released content
unstable: unreleased content

Unsupported plugins

Some plugins are currently unsupported in onyx 0.14.x. These are:

Companies Running Onyx in Production

Quick Start Guide

Feeling impatient? Hit the ground running ASAP with the onyx-starter repo and walkthrough. You can also boot into preloaded a Leiningen application template.

User Guide 0.14.6

Developer's Guide 0.14.6

API Docs 0.14.6

Code level API documentation can be found here.

Official plugin listing

Official plugins are vetted by Michael Drogalis. Ensure in your project that plugin versions directly correspond to the same Onyx version (e.g. onyx-kafka version 0.14.6.0-SNAPSHOT goes with onyx version 0.14.6). Fixes to plugins can be applied using a 4th versioning identifier (e.g. 0.14.6.1-SNAPSHOT).

Generate plugin templates through Leiningen with onyx-plugin.

3rd Party plugin listing

Unofficial plugins have not been vetted.

onyx-rethink

Need help?

Check out the Onyx Google Group.

Want the logo?

Feel free to use it anywhere. You can find a few different versions here.

Running the tests

A simple lein test will run the full suite for Onyx core.

Contributor list

Acknowledgements

Some code has been incorporated from the following projects:

License

Distributed under the Eclipse Public License, the same as Clojure.

onyx's People

Contributors

Stargazers

Watchers

Forkers

ahoy-jon zoowii vijaykiran ghosthamlet murraju esaul hlprmnky prasincs malcolmsparks martindale nullnotfound gsh45 snormore bblanton rukor ahammel swimablefish l1x jwymanm ebottabi masegraye sourceops cswaroop brennonyork narma ilovejs jnbala johnswanson neuroradiology yuppiechef mpenet hardikus zoka gregvirool jackdempsey flaing lbradstreet leathekd erichmond weaver-viii tomasu82 ravisinghsfbay jethrotan cloudbring podviaznikov sleyzerzon noisesmith intfrr tvanhens michael-okeefe darth10 mccraigmccraig oleschoenburg ideal-knee hkrishnan kevingreene fjdoria76 iperdomo deraen yashodhank kendru rowhit lgscofield tolitius extremenelson metasoarous bridgethillyer jocrau mushketyk gigafone zhouwanfeng kakamessi99 reaperhulk zhanghuabin tiensonqin mindis tharanga-abeyseela fengshao0907 jango2015 prayagupa it-crasy dlnufox delkyd lemonhall rasom qingniufly clojurians-org neverfox greywolve devth jcf fysoft2006 mariusz-jachimowicz-83 miaohongbiao malesch anusornc dhineshns honne davengeo liutigger007

onyx's Issues

Reintroduce task timeouts

In a very early version of Onyx, if the batch size of messages didn't accrue within a certain period of time, Onyx wouldn't attempt to keep reading and time out. This was removed due to a bug in HornetQ that didn't preserve sequential ordering. This is useful for sparse message streams, so I have found a workaround to add this back in.

Peer can deadlock on task completion

Reproduced in core.async plugin tests with a high number of virtual peers. Sometimes, closing a peer will block as it tries to flush its pipeline. The pipeline will block on reading from an ingress queue. This queue should always provide the sentinel value. Something is hanging on to the sentinel as a consumer and never committing it back to the queue, hence the hang.

Compress the user guide into a single page

Should enable better searchability of the docs.

Make job ID available in lifecycle context map

Key it under onyx.core/job-id.

Add a unit test for HornetQ JGroups configuration

I never got around to adding a test for this because I implemented JGroups before the test suite used embedded HornetQ. This is pretty straightforward to test with an embedded cluster.

Add :onyx/params to catalog entry

Allow value-level parameterization through the catalog.

[{...
  :my/param 42
  :my/other-param 44
  :onyx/params [:my/param :my/other-param]}]

why virtual peer use so many threads and channels

Hi, I wonder why virtual peer use 15 threads to process data? Are there other considerations?
(-> (fn1 event) fn2 fn3), isn't this simpler?

Change logging to be less chatty

Unsure of how I want it to look, but make it less verbose.

Alternate peer balancing strategy

As of 0.3.0, the only strategy for balancing peers across jobs and tasks is round robin/breadth-first, respectively. This ticket should break out the algorithms used for planning and coordination into functions behind multimethods, and allow for a greedy strategy. A greedy strategy will try to complete an entire job before moving on to the next.

Throw an exception on bad workflow format

Submitting a workflow that doesn't conform to the specification of the informational model throws an unhelpful assertion inside the Coordinator. The workflow should be validated using something a library like Schema to obtain helpful error messages.

Any task that follows a sequential task emits an exception

With any task proceeded by a sequential task, the following exception will be thrown:

org.hornetq.api.core.HornetQInternalErrorException: HQ119000: ClientSession closed while creating session
    type: #<INTERNAL_ERROR>
org.hornetq.core.client.impl.ClientSessionFactoryImpl.createSessionInternal      ClientSessionFactoryImpl.java:  782
        org.hornetq.core.client.impl.ClientSessionFactoryImpl.createSession      ClientSessionFactoryImpl.java:  366
                               sun.reflect.NativeMethodAccessorImpl.invoke0      NativeMethodAccessorImpl.java      
                                sun.reflect.NativeMethodAccessorImpl.invoke      NativeMethodAccessorImpl.java:   57
                            sun.reflect.DelegatingMethodAccessorImpl.invoke  DelegatingMethodAccessorImpl.java:   43
                                            java.lang.reflect.Method.invoke                        Method.java:  606
                                clojure.lang.Reflector.invokeMatchingMethod                     Reflector.java:   93
                           clojure.lang.Reflector.invokeNoArgInstanceMember                     Reflector.java:  313
                                            onyx.queue.hornetq/eval20966/fn                        hornetq.clj:  201
                                                clojure.lang.MultiFn.invoke                       MultiFn.java:  231
                                       onyx.peer.operation/start-lifecycle?                      operation.clj:   55
                                           onyx.peer.transform/eval21156/fn                      transform.clj:   95
                                                clojure.lang.MultiFn.invoke                       MultiFn.java:  231
                    onyx.peer.task-lifecycle-extensions/merge-api-levels/fn      task_lifecycle_extensions.clj:   19
                                             clojure.lang.ArrayChunk.reduce                    ArrayChunk.java:   63
                                                  clojure.core.protocols/fn                      protocols.clj:   98
                                                clojure.core.protocols/fn/G                      protocols.clj:   19

The virtual peer will shutdown and instantly reboot, continuing as normal. This bug is mostly harmless. It is causes by the concurrent optimizations set for the HornetQ configuration. The Session Factory is swapped out in favor of a different factory, but the new factory doesn't "stick" for new tasks. Virtual peers reuse the old Session Factory that has been closed. The exception is thrown. After reboot, a fresh Session Factory is used.

Harmless, but annoying to see in the logs.

Maximum peers per task option

The catalog should off an optional onyx/max-peers parameter that takes an integer value representing the maximum number of peers that may be executing an instance of that task at any single point in time.

Java API for Onyx

As of release 0.3.0, Clojure is the only supported language for Onyx. Java users can use the APIs that Clojure offers to tap some of the Onyx functionality, but this becomes problematic for areas such as lifecycle extensions that rely on implementations of multimethods.

Furthermore, EDN isn't the friendliest cross-language data format to send catalogs and workflows through. Part of this issue should explore options that Java users have on this front.

Throw an exception on bad catalog format

Submitting a catalog that doesn't conform to the specification of the informational model throws an unhelpful assertion inside the Coordinator. The catalog should be validated using something a library like Schema to obtain helpful error messages.

Parameterize back-off timeout when peer fails to join cluster

Hard coded for 250ms.

Reimplement await-job-completion

This was removed by the underlying mechanism changed for 0.5.0. Reimplement this API function.

Grouping configurable through catalog entry

From the mailing list:

"Also, it'd be really nice to specify a group-by operation on the input queue of each function, because then you could make things like wordcount really easy--you'd be able to say "send all instances of the same word to the the same downstream task", which would enable workflows that require an implicit sort/shuffle step."

https://groups.google.com/forum/#!topic/onyx-user/xniQcgCPEn8

Job throwing an exception crashes the peer

If a task throws an exception, the Peer crashes and isn't able to service additional work.

Update log command API reference

Command section is mostly correct, but needs a final pass to ensure that it's update to date.

JGroups peer configuration lists hornetq.udp namespace for some config options

See here https://github.com/MichaelDrogalis/onyx/blob/0.4.x/doc/user-guide/coord-peer-config.md#jgroups-configuration.

I think :hornetq.udp/refresh-timeout should be :hornetq.jgroups/refresh-timeout (along with the others)

Coordinator logging

The Coordinator logs very infrequently as of 0.3.0. Logging using Dire should be implemented on events like job submission, task completion, peer birth/death, etc.

Rename `await-job-completion` API

Seems like this function should block, but actually returns a future.

Standardize naming of lifecycle event map

There's been some confusion around what the difference between the "event map", "lifecycle event map", and "context map" are. They are all the same thing. This should be fixed in the docs. I think I'd like to choose "lifecycle event" as the canonical term.

Plugin request: Kafka

A Kafka plugin should be created that offers both input and output functionality. Additionally, it should be capable of working with Kafka partitions.

inject-lifecycle-resources is invoked twice for every task

Seeing this happen exactly twice, every single time. Never noticed until now.

Throw an exception when a workflow keyword isn't in the catalog

This silently fails right now, the exception gets swallowed up and everything halts.

Task that can complete when only 1 input has been exhausted

Considering a feature that will let a task complete when only one (not all) of its upstream inputs have pushed the sentinel onto the input stream. This would aid use cases where a privileged kill stream is utilized.

Just a placeholder, needs more thought.

Log Coordinator and Peer output to different files

If the Coordinator and Peer are running on the same machine, they'll log to the same file. This can be a little bit of a pain during development. Each should log to its own file.

Monitoring dashboard

This issue serves as a placeholder for the creation of another repository - onyx-dashboard. This dashboard will serve as a point of monitoring the status of what's happening inside Onyx by querying ZooKeeper. The data in ZooKeeper is immutable, and compressed with Fressian.

Emit an error if `:onyx/name` = `:onyx/type`

If both of these are the same, the same multimethod for lifecycle resources gets dispatched to. This is really confusing. See #36

submit-job hangs if catalog contains task maps with non-serializable values

If you supply a core async channel (or other non-serializable object) in a task map within a catalog, and submit-job, it will hang silently.

Ideally there should be some kind of validation of the catalog for serializabilty.

Validate workflow DAG

As mentioned in #2, workflow should be validated such that only input tasks are missing incoming edges, only output tasks should be missing output edges, and DAG should not have any cycles (dependency will throw an exception for you when creating the graph).

Map-style workflows don't validate anymore

Messed this one up during the redesign. Fix this before shipping 0.5.0.

Optimize aggregators

Aggregators are at a disadvantage, performance-wise, to transformers and groupers. Aggregators can only hold a single session open which needs to be reused across pipeline iterations in the peer. The reason for this is that if multiple sessions were used, all sessions needs to be read from at the same time since groupers will pin particular message ids to consumers. These sessions shouldn't be closed, otherwise the messages will be repinned. Further, once the sentinel is read, all other sessions will block indefinitely.

The goal of this issue is to speed up this aggregators using an alternate design approach.

Implement Coverage Protection

Coverage Protection is described here: https://github.com/MichaelDrogalis/onyx/blob/12c72be61d056446b1b7fe0a54a33782bbedc03b/doc/design/masterless.md#partial-coverage-protection

Processor spikes after multiple Onyx start/stop cycles

Observing high processor load after starting and stopping Onyx many times in the same repl session. Reproducing what @prasincs saw a few weeks ago.

Update masterless examples

Some of the examples are incorrect due to design changes, or require pictures for better explanation.

Use LMAX Disruptor for local execution

When we're running locally, it might be better for performance to use LMAX Disruptor instead of HornetQ for messaging. Requires benchmarking.

Plugin request: HDFS

This plugin should be capable of reading off the Hadoop file system and writing segments back to it. The point of input and partitioning should be a single file, and the partitioning will happen over the byte sequence representing the file distributed over blocks in the cluster.

Cast job-id to str in `await-job-completion`

This API will compare it to a UUID and fail. Do the cast inside the function so users don't need to worry about it.

Support full DAG workflows

In 0.4.0, we're going to move away from the tree/map based workflow to a vector-of-vectors. This will properly support multi input streams to any task, and continue to support multiple output streams. It will look like this:

[[:in-1 :inc]
 [:in-2 :inc]
 [:in-3 :inc]
 [:inc :out]]

Tasks:

Onyx should throw a helpful error when lifecycle methods do not return maps

The exception right now is confusing. start-lifecycle?, inject-lifecycle-resources, inject-temporal-resources,close-temporal-resources, andclose-lifecycle-resources` all need a type check on their return values.

Shutdown protocol needs to be reimplemented for dev-env and peer

Removed while construction took place on 0.5.0.

Run Jepsen tests against Onyx

It would be pretty great to Jepsen both HornetQ and Onyx itself. Partitioning virtual peers and coordinators fit the bill nicely.

Implement kill-job core API function

There needs to be an API function that takes a job ID and halts any peer execution of that job's tasks. The job's tasks will no longer be eligible for execution.

Restart peer on failure

When a peer dies, it should attempt to recover by rebooting itself. See core.async test for failure.

Virtual peers can be starved on Grouping tasks

This is a particularly rough edge case. If a virtual peer receives a grouping task, it's capable of being starved from receiving the sentinel segment off the queue due to the way that HornetQ pins messages as it groups. If a consumer closes out, it might not necessarily requeue the sentinel in a server node where other consumers can reach it. Hence, the other virtual peers may deadlock and wait forever. This only affects batch mode - streaming mode is fine.