Giter VIP home page Giter VIP logo

onyx's Introduction

Logo Onyx

Join the chat at https://gitter.im/onyx-platform/onyx

What is it?

  • a masterless, cloud scale, fault tolerant, high performance distributed computation system
  • batch and stream hybrid processing model
  • exposes an information model for the description and construction of distributed workflows
  • Competes against Storm, Flink, Cascading, Cascalog, Spark, Map/Reduce, Sqoop, etc
  • written in pure Clojure

What would I use this for?

  • Realtime event stream processing
  • CQRS
  • Continuous computation
  • Extract, transform, load
  • Data transformation à la map-reduce
  • Data ingestion and storage medium transfer
  • Data cleaning

Installation

Available on Clojars:

[org.onyxplatform/onyx "0.14.6"]

Changelog

Changelog can be found at changes.md.

Quick Lookup Doc

A searchable set of documentation for the Onyx data model is available.

Project Template

A project template can be found at onyx-template.

Plugins and Libraries

Plugin Template

We provide a plugin template for use in building new plugins. This can be found at onyx-plugin.

Plugin Use

To use the supported plugins, please use version coordinates such as [org.onyxplatform/onyx-amazon-sqs "0.14.6.SNAPSHOT.0"], and read the READMEs on the 0.14.x branches linked above.

Build Status

Component release unstable
onyx core Circle CI Circle CI
onyx-local-rt Circle CI Circle CI
onyx-kafka Circle CI Circle CI
onyx-datomic Circle CI Circle CI
onyx-redis Circle CI Circle CI
onyx-sql Circle CI Circle CI
onyx-bookkeeper Circle CI Circle CI
onyx-amazon-sqs Circle CI Circle CI
onyx-amazon-s3 Circle CI Circle CI
onyx-http Circle CI Circle CI
learn-onyx Circle CI -
onyx-examples Circle CI Circle CI
onyx-peer-http-query Circle CI Circle CI
lib-onyx Circle CI Circle CI
onyx-plugin Circle CI Circle CI
onyx-template Circle CI Circle CI
  • release: stable, released content
  • unstable: unreleased content

Unsupported plugins

Some plugins are currently unsupported in onyx 0.14.x. These are:

Companies Running Onyx in Production

LockedOn                                                                                               

Quick Start Guide

Feeling impatient? Hit the ground running ASAP with the onyx-starter repo and walkthrough. You can also boot into preloaded a Leiningen application template.

User Guide 0.14.6

Developer's Guide 0.14.6

API Docs 0.14.6

Code level API documentation can be found here.

Official plugin listing

Official plugins are vetted by Michael Drogalis. Ensure in your project that plugin versions directly correspond to the same Onyx version (e.g. onyx-kafka version 0.14.6.0-SNAPSHOT goes with onyx version 0.14.6). Fixes to plugins can be applied using a 4th versioning identifier (e.g. 0.14.6.1-SNAPSHOT).

Generate plugin templates through Leiningen with onyx-plugin.

3rd Party plugin listing

Unofficial plugins have not been vetted.

Need help?

Check out the Onyx Google Group.

Want the logo?

Feel free to use it anywhere. You can find a few different versions here.

Running the tests

A simple lein test will run the full suite for Onyx core.

Contributor list

Acknowledgements

Some code has been incorporated from the following projects:

License

Copyright © 2017 Michael Drogalis

Distributed under the Eclipse Public License, the same as Clojure.

onyx's People

Contributors

bamarco avatar brianh avatar bridgethillyer avatar codonnell avatar colinhicks avatar danielcompton avatar davidrupp avatar devth avatar gardnervickers avatar greywolve avatar jqmtor avatar lbradstreet avatar leathekd avatar malcolmsparks avatar malesch avatar mariusz-jachimowicz-83 avatar metasoarous avatar michaeldrogalis avatar mushketyk avatar nathants avatar oleschoenburg avatar owengalenjones avatar schmee avatar solatis avatar sundbry avatar superstevenz avatar the-alchemist avatar tvanhens avatar vijaykiran avatar yonatane avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

onyx's Issues

Reintroduce task timeouts

In a very early version of Onyx, if the batch size of messages didn't accrue within a certain period of time, Onyx wouldn't attempt to keep reading and time out. This was removed due to a bug in HornetQ that didn't preserve sequential ordering. This is useful for sparse message streams, so I have found a workaround to add this back in.

Peer can deadlock on task completion

Reproduced in core.async plugin tests with a high number of virtual peers. Sometimes, closing a peer will block as it tries to flush its pipeline. The pipeline will block on reading from an ingress queue. This queue should always provide the sentinel value. Something is hanging on to the sentinel as a consumer and never committing it back to the queue, hence the hang.

Add :onyx/params to catalog entry

Allow value-level parameterization through the catalog.

[{...
  :my/param 42
  :my/other-param 44
  :onyx/params [:my/param :my/other-param]}]

Alternate peer balancing strategy

As of 0.3.0, the only strategy for balancing peers across jobs and tasks is round robin/breadth-first, respectively. This ticket should break out the algorithms used for planning and coordination into functions behind multimethods, and allow for a greedy strategy. A greedy strategy will try to complete an entire job before moving on to the next.

Throw an exception on bad workflow format

Submitting a workflow that doesn't conform to the specification of the informational model throws an unhelpful assertion inside the Coordinator. The workflow should be validated using something a library like Schema to obtain helpful error messages.

Any task that follows a sequential task emits an exception

With any task proceeded by a sequential task, the following exception will be thrown:

org.hornetq.api.core.HornetQInternalErrorException: HQ119000: ClientSession closed while creating session
    type: #<INTERNAL_ERROR>
org.hornetq.core.client.impl.ClientSessionFactoryImpl.createSessionInternal      ClientSessionFactoryImpl.java:  782
        org.hornetq.core.client.impl.ClientSessionFactoryImpl.createSession      ClientSessionFactoryImpl.java:  366
                               sun.reflect.NativeMethodAccessorImpl.invoke0      NativeMethodAccessorImpl.java      
                                sun.reflect.NativeMethodAccessorImpl.invoke      NativeMethodAccessorImpl.java:   57
                            sun.reflect.DelegatingMethodAccessorImpl.invoke  DelegatingMethodAccessorImpl.java:   43
                                            java.lang.reflect.Method.invoke                        Method.java:  606
                                clojure.lang.Reflector.invokeMatchingMethod                     Reflector.java:   93
                           clojure.lang.Reflector.invokeNoArgInstanceMember                     Reflector.java:  313
                                            onyx.queue.hornetq/eval20966/fn                        hornetq.clj:  201
                                                clojure.lang.MultiFn.invoke                       MultiFn.java:  231
                                       onyx.peer.operation/start-lifecycle?                      operation.clj:   55
                                           onyx.peer.transform/eval21156/fn                      transform.clj:   95
                                                clojure.lang.MultiFn.invoke                       MultiFn.java:  231
                    onyx.peer.task-lifecycle-extensions/merge-api-levels/fn      task_lifecycle_extensions.clj:   19
                                             clojure.lang.ArrayChunk.reduce                    ArrayChunk.java:   63
                                                  clojure.core.protocols/fn                      protocols.clj:   98
                                                clojure.core.protocols/fn/G                      protocols.clj:   19

The virtual peer will shutdown and instantly reboot, continuing as normal. This bug is mostly harmless. It is causes by the concurrent optimizations set for the HornetQ configuration. The Session Factory is swapped out in favor of a different factory, but the new factory doesn't "stick" for new tasks. Virtual peers reuse the old Session Factory that has been closed. The exception is thrown. After reboot, a fresh Session Factory is used.

Harmless, but annoying to see in the logs.

Maximum peers per task option

The catalog should off an optional onyx/max-peers parameter that takes an integer value representing the maximum number of peers that may be executing an instance of that task at any single point in time.

Java API for Onyx

As of release 0.3.0, Clojure is the only supported language for Onyx. Java users can use the APIs that Clojure offers to tap some of the Onyx functionality, but this becomes problematic for areas such as lifecycle extensions that rely on implementations of multimethods.

Furthermore, EDN isn't the friendliest cross-language data format to send catalogs and workflows through. Part of this issue should explore options that Java users have on this front.

Throw an exception on bad catalog format

Submitting a catalog that doesn't conform to the specification of the informational model throws an unhelpful assertion inside the Coordinator. The catalog should be validated using something a library like Schema to obtain helpful error messages.

Coordinator logging

The Coordinator logs very infrequently as of 0.3.0. Logging using Dire should be implemented on events like job submission, task completion, peer birth/death, etc.

Standardize naming of lifecycle event map

There's been some confusion around what the difference between the "event map", "lifecycle event map", and "context map" are. They are all the same thing. This should be fixed in the docs. I think I'd like to choose "lifecycle event" as the canonical term.

Plugin request: Kafka

A Kafka plugin should be created that offers both input and output functionality. Additionally, it should be capable of working with Kafka partitions.

Task that can complete when only 1 input has been exhausted

Considering a feature that will let a task complete when only one (not all) of its upstream inputs have pushed the sentinel onto the input stream. This would aid use cases where a privileged kill stream is utilized.

Just a placeholder, needs more thought.

Monitoring dashboard

This issue serves as a placeholder for the creation of another repository - onyx-dashboard. This dashboard will serve as a point of monitoring the status of what's happening inside Onyx by querying ZooKeeper. The data in ZooKeeper is immutable, and compressed with Fressian.

Validate workflow DAG

As mentioned in #2, workflow should be validated such that only input tasks are missing incoming edges, only output tasks should be missing output edges, and DAG should not have any cycles (dependency will throw an exception for you when creating the graph).

Optimize aggregators

Aggregators are at a disadvantage, performance-wise, to transformers and groupers. Aggregators can only hold a single session open which needs to be reused across pipeline iterations in the peer. The reason for this is that if multiple sessions were used, all sessions needs to be read from at the same time since groupers will pin particular message ids to consumers. These sessions shouldn't be closed, otherwise the messages will be repinned. Further, once the sentinel is read, all other sessions will block indefinitely.

The goal of this issue is to speed up this aggregators using an alternate design approach.

Update masterless examples

Some of the examples are incorrect due to design changes, or require pictures for better explanation.

Plugin request: HDFS

This plugin should be capable of reading off the Hadoop file system and writing segments back to it. The point of input and partitioning should be a single file, and the partitioning will happen over the byte sequence representing the file distributed over blocks in the cluster.

Support full DAG workflows

In 0.4.0, we're going to move away from the tree/map based workflow to a vector-of-vectors. This will properly support multi input streams to any task, and continue to support multiple output streams. It will look like this:

[[:in-1 :inc]
 [:in-2 :inc]
 [:in-3 :inc]
 [:inc :out]]

Tasks:

  • Function to convert map to DAG to preserve backward-compatibility
  • Read from HornetQ queue's using Go blocks
  • Reserve space in peer local atom to cache sentinel value's that have been seen
  • Expand ZooKeeper sentinel reduction process to work across multiple input queues
  • Alter planning to construct queue chains based off DAGs
  • Intercept map at job submission time and conver to DAG
  • Ignore input channels that are known to be exhausted
  • Alternate input priority with round robin
  • Change aggregate to read like transform does
  • Alter validation to allow for DAG workflow
  • Update documentation about map and vector workflows
  • Update documentation to remove ack thread
  • Update documentation to specify that read-batch must return a map
  • Remove status check from pipeline

Run Jepsen tests against Onyx

It would be pretty great to Jepsen both HornetQ and Onyx itself. Partitioning virtual peers and coordinators fit the bill nicely.

Implement kill-job core API function

There needs to be an API function that takes a job ID and halts any peer execution of that job's tasks. The job's tasks will no longer be eligible for execution.

Restart peer on failure

When a peer dies, it should attempt to recover by rebooting itself. See core.async test for failure.

Virtual peers can be starved on Grouping tasks

This is a particularly rough edge case. If a virtual peer receives a grouping task, it's capable of being starved from receiving the sentinel segment off the queue due to the way that HornetQ pins messages as it groups. If a consumer closes out, it might not necessarily requeue the sentinel in a server node where other consumers can reach it. Hence, the other virtual peers may deadlock and wait forever. This only affects batch mode - streaming mode is fine.

Virtual peers can fail to make progress at end of task

Reproduced with the grouping test in Onyx core by turning up the number of virtual peers. Observed that two peers can continually take the sentinel segment off the queue and re-enqueue it infinitely, neither of them able to complete the task.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.