Giter VIP home page Giter VIP logo

transfer_jobs's Introduction

Transfer Jobs

TransferJobs is a gem providing components and tasks to recover from failure of redis delayed job queues. We provide out-of-the-box support for common Sidekiq and Resque configurations as well as the tools you need to assemble a more complex recovery flow.

Installation

To install transfer_jobs simply add gem 'transfer_jobs', source: xxxx to your Gemfile. This will provide access to all the classes and components of the gem.

To use the rake tasks you will need to include the relevant task in your Rakefile by adding:

Sidekiq:

import 'transfer_jobs/tasks/sidekiq.rake'

Resque:

# not yet implemented
import 'transfer_jobs/tasks/resque.rake'

The rake tasks have secondary dependencies implied by their names. They are detailed in the section relevant to your job system.

Optional Dependencies

There are several optional dependencies that enable quality-of-life features.

  • Progressrus: Including the progressrus gem enables progress tracking during your transfers.

Resque

Resque support is currently in the works.

Sidekiq

TransferJobs supports common Sidekiq job configurations. The provided classes and modules should enable the construction of more complex or atypical flows. The provided sidekiq:transfer rake task supports the following out-of-the-box:

  • sidekiq ~> 5.0
  • sidekiq-unique-jobs ~> 5.0.10

Caveats

Currently, we only support a limited subset of features provided by Sidekiq and it's ecosystem. We support, locked jobs, delayed jobs, retried jobs, dead jobs and normal jobs. Anything outside of that is not explicitly supported but may still function. We also explicitly won't work with Sidekiq's expiring jobs. If you are interested in support for additional features, please open an issue.

Usage

Transfers can be initiated by running the relevant rake task with the correct parameters.

For example, to transfer a Sidekiq application's tasks you would run:

bundle exec rake sidekiq:transfer SOURCE=redis://facebook-commerce.railgun/0 SOURCE=redis://facebook-commerce.railgun/1

This will transfer all jobs from the 0 db to the 1 db on the Redis host facebook-commerce.railgun

Failure

Just as important as knowing how to run your job transfers, understanding how they fail is key to ensuring the consistency of your jobs.

To prevent race conditions and double performs / enqueues, transfer_jobs relies heavily on renaming objects in Redis. When we start transferring a job queue we begin by performing a RENAME, on that queue, appending :recovery to the name. This transforms a normal queue to normal:recovery. This effectively hides those jobs from your job processing system, allowing us to transfer them in peace.

If a previous transfer was interrupted, a recovery queue will have been left behind. When transfer_jobs is run again it will detect that existing recovery queue and resume the previous transfer. However it will not rename the existing queue. That means you will need to run transfer_jobs for a third time.

The other major risk is that a transfer is killed or interrupted midway through. Internally, we take measures to make transfers safe to interrupt, watching for signals that would indicate an exit and cleaning things up. However, there is always the possibility of uncontrolled-exit. When this happens there is potential for the batch of jobs currently being transferred to be duplicated in the target datacenter.

transfer_jobs's People

Contributors

fbogsany avatar xldenis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

qpc-github

transfer_jobs's Issues

Utility of Recovery Queues

As part of the transfer-jobs step of a failover, we rename each queue in the source region to {queue_name}:recovery. This has the following benefits:

  1. Workers will stop popping jobs off of those queues, because those recovery queues are not configured to be worked off
  2. In the case of a failed job transfer, TJ knows to resume working from the recovery queues,

I'm wondering whether renaming these queues is really the simplest way of achieving (1) and (2):

  • Renaming queues is confusing, since some jobs in the delayed queue can technically be scheduled to go back to some of those queues that were renamed... (and Redis will recreate those queues in the passive region). This is possible because we transfer the delayed queue only after all the other job queues have been transferred.
  • This can lead us to a situation where a failed transfer job can lead to the recovery being ineffective, since TJ has to check both the recovery queue and the "live" queue to recover. Worse, the retried job can be picked up by a worker, since it's back on a "live" queue (but it won't get worked off since we failover MySQL before transferring jobs and workers skip jobs in pods that are read-only).

Rather, we could do the following:

  1. Is already achieved by the database healthcheck before a worker works off a job. If we can ensure that this is very fast, this is a viable safeguard against working off jobs in the passive region.
  2. TJ can just resume work from the "real" queues when recovering from failures, since those queues will be preserved by (1).

This means we can achieve the same functionality, with simpler code. Alternatively, we can check at runtime if a recovery queue exists when retrying a job, to enqueue the job on the recovery queue rather than the "live" queue, but this seems to introduce complexity rather than reduce it.

Am I missing any historical context on why we need the recovery queues? Can we do without them as presented above?

cc @Shopify/pods @Shopify/job-patterns

Improve feature detection mechanism

There are a couple place where I've gated behaviour behind feature detection, ie if we detect that the Progressrus constant is defined we'll use that otherwise we define a dummy Progressrus-like object. And a similar thing is done for SidekiqUniqueJobs. I'd like to see if we can simplify this and make it more idiomatic.

Add support for sidekiq-scheduler dynamic schedules

I'm not certain this is a frequently used feature but it's always nice to support features fully. Sidekiq Scheduler has support for 'dynamic' schedules, aka cron schedules that are mutable and stored in redis. We should support transferring / synchronizing them.

Batched Lock Acquisition and Release

There needs to be support for batched lock acquisition and release. Some workloads heavily rely on locking. Currently we sequentially acquire and release locks which takes a very long time.

Since acquiring / releasing locks typically relies on lua scripts we may need to write a "MLOCK" and "MRELEASE" equivalent script.

Expand Sidekiq Test suite

We have an extensive test suite for Resque still in the Core repo, we should port it over to sidekiq to expand our coverage.

Ensure resiliency

We need to do a resiliency pass on TJ and make sure it gracefully handles issues like network latency or losing connectivity to a redis host.

Investigate usage of MIGRATE

TransferJobs has to stream the jobs down to a server running the ruby script, this can add quite a bit of latency. If we could figure out a way to use MIGRATE instead it would save quite a lot of network traffic.

Future of this gem

This gem was meant to extract and abstract away the transfer-jobs step of AF. But as it stands, functionality is still fragmented between the gem and Shopify/shopify. We need to decide how to consolidate the code.

@kirs suggested that we move as much as possible of the "Transferring Redis data" concern to this gem, while leaving Podding concerns and the rake task itself in Shopify/shopify.

@Shopify/pods @Shopify/job-patterns Thoughts?

Build additional tooling on top of TJ

Using the classes and modules of TJ we should be able to build other tools, for example I was thinking of a tool that could quarantine jobs matching a filter.

Some ideas of tools we could build:

  • Quarantine / Delete Jobs matching a filter
  • Reorganize queues (move matching jobs to front / end)
  • Collect statistics about jobs in queue

Make tasks work with rake -T

Currently, the tasks don't show up in rake -T I believe this is because we check if Sidekiq has been loaded but rake -T doesn't load the environment. To fix this we should find a way to provide a 'dummy' task that is always loaded and overwritten when we load the real task.

Extract an abstract QueueMover class

Shopify core's "queue mover" relies on internal modules so we can't define it in the gem. We should fix an api for "queue movers" (like sidekiq_mover.rb) which allows us to keep the define application specific ones when required.

This PR: #7 implements a sketch of the idea.

Tune batch size

We currently use a batch size of 1k but it's that's not based on any thorough work, we should check if it's suitable or if we should actually be using a different value.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.