shopify / transfer_jobs Goto Github PK

recover and move background jobs around

Ruby 100.00%

transfer_jobs's Introduction

Transfer Jobs

TransferJobs is a gem providing components and tasks to recover from failure of redis delayed job queues. We provide out-of-the-box support for common Sidekiq and Resque configurations as well as the tools you need to assemble a more complex recovery flow.

Installation

To install transfer_jobs simply add gem 'transfer_jobs', source: xxxx to your Gemfile. This will provide access to all the classes and components of the gem.

To use the rake tasks you will need to include the relevant task in your Rakefile by adding:

Sidekiq:

import 'transfer_jobs/tasks/sidekiq.rake'

Resque:

# not yet implemented
import 'transfer_jobs/tasks/resque.rake'

The rake tasks have secondary dependencies implied by their names. They are detailed in the section relevant to your job system.

Optional Dependencies

There are several optional dependencies that enable quality-of-life features.

Progressrus: Including the progressrus gem enables progress tracking during your transfers.

Resque

Resque support is currently in the works.

Sidekiq

TransferJobs supports common Sidekiq job configurations. The provided classes and modules should enable the construction of more complex or atypical flows. The provided sidekiq:transfer rake task supports the following out-of-the-box:

sidekiq ~> 5.0
sidekiq-unique-jobs ~> 5.0.10

Caveats

Currently, we only support a limited subset of features provided by Sidekiq and it's ecosystem. We support, locked jobs, delayed jobs, retried jobs, dead jobs and normal jobs. Anything outside of that is not explicitly supported but may still function. We also explicitly won't work with Sidekiq's expiring jobs. If you are interested in support for additional features, please open an issue.

Usage

Transfers can be initiated by running the relevant rake task with the correct parameters.

For example, to transfer a Sidekiq application's tasks you would run:

bundle exec rake sidekiq:transfer SOURCE=redis://facebook-commerce.railgun/0 SOURCE=redis://facebook-commerce.railgun/1

This will transfer all jobs from the 0 db to the 1 db on the Redis host facebook-commerce.railgun

Failure

Just as important as knowing how to run your job transfers, understanding how they fail is key to ensuring the consistency of your jobs.

To prevent race conditions and double performs / enqueues, transfer_jobs relies heavily on renaming objects in Redis. When we start transferring a job queue we begin by performing a RENAME, on that queue, appending :recovery to the name. This transforms a normal queue to normal:recovery. This effectively hides those jobs from your job processing system, allowing us to transfer them in peace.

If a previous transfer was interrupted, a recovery queue will have been left behind. When transfer_jobs is run again it will detect that existing recovery queue and resume the previous transfer. However it will not rename the existing queue. That means you will need to run transfer_jobs for a third time.

The other major risk is that a transfer is killed or interrupted midway through. Internally, we take measures to make transfers safe to interrupt, watching for signals that would indicate an exit and cleaning things up. However, there is always the possibility of uncontrolled-exit. When this happens there is potential for the batch of jobs currently being transferred to be duplicated in the target datacenter.

transfer_jobs's People

Contributors

Stargazers

Watchers

Forkers

qpc-github

transfer_jobs's Issues

Utility of Recovery Queues

As part of the transfer-jobs step of a failover, we rename each queue in the source region to {queue_name}:recovery. This has the following benefits:

Workers will stop popping jobs off of those queues, because those recovery queues are not configured to be worked off
In the case of a failed job transfer, TJ knows to resume working from the recovery queues,

I'm wondering whether renaming these queues is really the simplest way of achieving (1) and (2):

Renaming queues is confusing, since some jobs in the delayed queue can technically be scheduled to go back to some of those queues that were renamed... (and Redis will recreate those queues in the passive region). This is possible because we transfer the delayed queue only after all the other job queues have been transferred.
This can lead us to a situation where a failed transfer job can lead to the recovery being ineffective, since TJ has to check both the recovery queue and the "live" queue to recover. Worse, the retried job can be picked up by a worker, since it's back on a "live" queue (but it won't get worked off since we failover MySQL before transferring jobs and workers skip jobs in pods that are read-only).

Rather, we could do the following:

Is already achieved by the database healthcheck before a worker works off a job. If we can ensure that this is very fast, this is a viable safeguard against working off jobs in the passive region.
TJ can just resume work from the "real" queues when recovering from failures, since those queues will be preserved by (1).

This means we can achieve the same functionality, with simpler code. Alternatively, we can check at runtime if a recovery queue exists when retrying a job, to enqueue the job on the recovery queue rather than the "live" queue, but this seems to introduce complexity rather than reduce it.

Am I missing any historical context on why we need the recovery queues? Can we do without them as presented above?

cc @Shopify/pods @Shopify/job-patterns

Improve feature detection mechanism

There are a couple place where I've gated behaviour behind feature detection, ie if we detect that the Progressrus constant is defined we'll use that otherwise we define a dummy Progressrus-like object. And a similar thing is done for SidekiqUniqueJobs. I'd like to see if we can simplify this and make it more idiomatic.

Add support for sidekiq-scheduler dynamic schedules

I'm not certain this is a frequently used feature but it's always nice to support features fully. Sidekiq Scheduler has support for 'dynamic' schedules, aka cron schedules that are mutable and stored in redis. We should support transferring / synchronizing them.

Batched Lock Acquisition and Release

There needs to be support for batched lock acquisition and release. Some workloads heavily rely on locking. Currently we sequentially acquire and release locks which takes a very long time.

Since acquiring / releasing locks typically relies on lua scripts we may need to write a "MLOCK" and "MRELEASE" equivalent script.

Audit Pro & Enterprise for unsupported features

It's unclear whether the Pro & Enterprise versions have features that could be incompatible with Transfer-Jobs. We should investigate to make sure there aren't any problems.

Expand Sidekiq Test suite

We have an extensive test suite for Resque still in the Core repo, we should port it over to sidekiq to expand our coverage.

Support new sidekiq-unique-jobs lock api

I've noticed in the source for sidekiq-unique-jobs that a major rewrite of all the locking mechanisms is incoming. We need to adapt TJ to support it.

Ensure resiliency

We need to do a resiliency pass on TJ and make sure it gracefully handles issues like network latency or losing connectivity to a redis host.

Investigate usage of MIGRATE

TransferJobs has to stream the jobs down to a server running the ruby script, this can add quite a bit of latency. If we could figure out a way to use MIGRATE instead it would save quite a lot of network traffic.

Future of this gem

This gem was meant to extract and abstract away the transfer-jobs step of AF. But as it stands, functionality is still fragmented between the gem and Shopify/shopify. We need to decide how to consolidate the code.

@kirs suggested that we move as much as possible of the "Transferring Redis data" concern to this gem, while leaving Podding concerns and the rake task itself in Shopify/shopify.

@Shopify/pods @Shopify/job-patterns Thoughts?

Build additional tooling on top of TJ

Using the classes and modules of TJ we should be able to build other tools, for example I was thinking of a tool that could quarantine jobs matching a filter.

Some ideas of tools we could build:

Quarantine / Delete Jobs matching a filter
Reorganize queues (move matching jobs to front / end)
Collect statistics about jobs in queue

Copy or Import old test suite

We should copy / import the old resque test suite and generalize it all to work with sidekiq as well.

Make tasks work with rake -T

Currently, the tasks don't show up in rake -T I believe this is because we check if Sidekiq has been loaded but rake -T doesn't load the environment. To fix this we should find a way to provide a 'dummy' task that is always loaded and overwritten when we load the real task.

Extract an abstract QueueMover class

Shopify core's "queue mover" relies on internal modules so we can't define it in the gem. We should fix an api for "queue movers" (like sidekiq_mover.rb) which allows us to keep the define application specific ones when required.

This PR: #7 implements a sketch of the idea.

Tune batch size

We currently use a batch size of 1k but it's that's not based on any thorough work, we should check if it's suitable or if we should actually be using a different value.