groupon / backbeat Goto Github PK
View Code? Open in Web Editor NEWA workflow service for processing asynchronous tasks across distributed systems
License: BSD 3-Clause "New" or "Revised" License
A workflow service for processing asynchronous tasks across distributed systems
License: BSD 3-Clause "New" or "Revised" License
The primary cause of #12 was that the client was signalling Activities with no subject and no parent context. This resulted in the Server.create_workflow function finding/creating a Workflow with a nil subject (Thanks, Ruby), and thereafter all subjectless signals are now scheduled by the same Workflow.
I propose that if a Client signals without a subject, we should assume the signal is a one-off with no associated workflow, and instead of nil
, give it a SecureRandom.uuid
. The Principle of Least Surprise says that if we signal an activity without a subject, it should not surprise you by joining a list of other activities in a workflow.
I'm currently working on adding some additional instrumentation to the project and running into a bit of a problem figuring out how to extend or add instrumentation to the app. The kind of instrumentation I want to add (Zipkin, etc and Steno) is not something I would consider globally-applicable to the open-source community.
The first option is to just fork and modify the project.
While this would obviously be the most straightforward and easy way to get the changes I want, I don't see that as a sustainable way of using Backbeat and don't want to encourage that kind of behavior.
What I would actually prefer is a way of putting Backbeat into a container which I control, then using a public configuration API to add the instrumentation I need, just as Rails does.
I think this is a far better way forward with the project, forcing modularity and a clear API for modifications. Since Backbeat is already a Rack application, it is relatively easy to pack it up into a directory and then point a new Rackup file to the backbeat App.
To take this far enough to be viable, I think that we should consider repackaging Backbeat Server as a Gem in a manner similar to Rails, with a simple generator to produce the environment configuration files. Then we would be able to VCS the configuration privately and add Groupon-specific tweaks/tooling to it, while maintaining a clean and configurable project for the Open Source community.
@keithwiersema does the team still run the backbeat integration tests with the docker setup? Would this be something you could ask to be open sourced?
Thanks!
This is because there is no validation around node's status in the "/decisions" endpoint.
Our retry logic currently makes it impossible to have a task that retries at T+0, T+5, T+25, t+125. The minimum amount of time for the first retry is 4 minutes, and can be up to 60 minutes. The interval input makes almost no difference.
I've published a branch with a new backoff algorithm, but there's some concern about how it will affect existing activities.
In order to satisfy all clients, we will need to implement a backoff selection mechanism in the server. In addition to selecting the number of retries and the interval between, clients may now also choose a specific backoff algorithm, which will be implemented as a backoff strategy in the scheduler that inserts the next attempt on an activity into Sidekiq. In order to make it possible to calculate backoff factors, we will also need to add a max_retries
column to the node state, because the node retries are decremented fields, and there is no way of telling otherwise how far from the origin number they are.
_IMPL_
Modify the Backbeat::Server class, where it creates the NodeDetail
object for the node. Here, we will accept an additional parameter, backoff
, which will be one of: [:legacy, :exponential, :constant]
.
When instantiating the NodeDetail
object, automatically record the retries
as max_retries
.
Write a database migration to add the backoff
string(32) column to the node_details
table.
Write a database migration to add the max_retries
int column to the node_details
table.
Modify the Backbeat::Schedulers::ScheduleRetry singleton as follows:
Isolate the computation of the time, only update the node and return the time which should be used to schedule the async job. Delegate the backoff calculation to an instance of BackoffCalculator
, which accepts a node.
Modify Backbeat::Node such that it delegates backoff
to :node_detail
.
Write a BackoffCalculator
, which accepts a node and an optional time which defaults to Time.now
. This class reads the node.backoff
and selects a BackoffStrategy
function based on a mapping of that value to these functions. If no backoff is specified, default to :legacy
The BackoffStrategy
always accepts a retry_number
and returns a time. It uses the time that may be injected into BackoffCalculator
.
Implement the three strategies:
:legacy =>
Same as is currently present in the ScheduleRetry block.:exponential =>
Calculates retry as an exponential function with three products; the exponential product: 2^r
, where r
is the number of the retry, a stampede-reduction factor randomly chosen between 0.8 and 1.2, and finally the retry_interval
.:constant =>
Calculates retry as a simple addition, adding the retry_interval
to the current time._AC_
Unit tests should be written for the changes.
Integration tests should be updated to use a variety of the backoff strategies.
Backbeat integration tests pass
Integration branch: master
BACKBEAT_SERVER_VERSION=exponential-backoff bundle exec rake docker:test
This service looks pretty slick! Great job!
If I have a Rails app and the app needs a workflow, is there a way to directly include Backbeat server as part of that Rails app? Or do I have to make Backbeat server as its own service?
We had an issue today where a mistake caused blocking jobs to be queued onto the same workflow. When the size of this workflow grew, it caused a resource contention issue on the table, and on the JVM as GC began to intensify. In the end, the entire server was monopolized by repeated calls to ScheduleNextNode for the same workflow.
While we cannot prevent clients from behaving badly and creating massive backlogs of workflows, we can reduce the potential for one job to monopolize the worker pool.
A co-worker found: https://github.com/mhenrixon/sidekiq-unique-jobs. What do you think?
I'm misunderstanding the API I think.
It seems to me that a client must implement the /decision endpoint for the first activity in a workflow, because the /signal
behavior of Server
seen here defaults the legacy_type to decision
, which causes the client to attempt to contact the decision callback.
What's the expected interaction pattern? Here, I'll post a big GIST with my test in it.
My issue is that in the integration test, the client fails to call the java app because it would call the decision endpoint, because the node is legacy_type: decision
, after using the signal API.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.