Giter VIP home page Giter VIP logo

routemaster's Introduction

Routemaster Build Code Climate Test Coverage Dependency Status

Intro | Rationale | Installing | Configuration | API | Sources of inspiration

Routemaster is an opinionated event bus over HTTP, supporting event-driven / representational state notification architectures.

Routemaster aims to dispatch events with a median latency in the 50 - 100ms range, with no practical upper limit on throughput.

Routemaster comes with, and is automatically integration-tested against a Ruby client, routemaster-client.

For advanced bus consumers, routemaster-drain can perform filtering of event stream and preemptive caching of resources.

The basics

Routemaster lets publisher push events into topics, and subscribers receive events about topics they've subscribed to.

Pushing, receiving, and subscribing all happen over HTTP.

Events are (by default) delivered in ordered batches, ie. a given HTTP request to a subscriber contains several events, from all subscribed topics.

Rationale

We built Routemaster because existing buses for distributed architectures aren't satisfactory to us; either they're too complex to host and maintain, don't support key features (persistence), or provide too much rope to hang ourselves with.

Remote procedure call as an antipattern

Routemaster is designed on purpose to not support RPC-style architectures, for instance by severely limiting payload contents.

It only supports notifying consumers about lifecycle (CRUD) changes to resources, and strongly suggests that consumers obtain their JSON out-of-band.

The rationale is that, much like it's all too easy to add non-RESTful routes to a web application, it's all too easy to damage a resource-oriented architecture by spreading concerns across applications, thus coupling them too tightly.

Leverage HTTP to scale

In web environments, the one type of server that scales well and can scale automatically with little effort is an HTTP server. As such, Routemaster heavily relies on HTTP.

Don't call us, we'll call you: Inbound events are delivered over HTTP so that the bus itself can scale to easily process a higher (or lower) throughput of events with consistent latency.

Outbound events are delivered over HTTP so that subscribers can scale their processing of events as easily.

We believe the cost in latency in doing so (as compared to lower-level messaging systems such as the excellent RabbitMQ) is offset by easier maintenance, more sound architecture (standards-only, JSON over HTTP), and better scalability.

Future versions of Routemaster may support (backwards-compatible) long-polling HTTP sessions to cancel out the latency cost.

Persistence

The web is a harsh place. Subscribers may die, or be unreachable in many ways for various amounts of time.

Routemaster will keep buffering, and keep trying to push events to subscribers until they become reachable again.

Topics and Subscriptions

Topics are where the inbound events are sent. There should be one topic per domain concept, e.g. properties, bookings, users.

Only one client may publish/push to a topic (and it should be the authoritative application for the concept).

Each topic fans out to multiple subscriptions which are where the outbound events pile in. Each pulling client (subscriber) has exactly one subscription queue which aggregates events from multiple topics.

A subscriber can "catch up" event if it hasn't pulled events for a while (events get buffered in subscription queues).


Installing & Configuring

In order to have Routemaster receive connections from a publisher or a subscriber, their API tokens ("uuid") must be registered.

Registration is performed using the /api_token APIs using the ROUTEMASTER_ROOT_KEY API token, like so:

curl -s -X POST -u <root-key> https://bus.example.com/api_tokens

For further configuration options please check the provided .env files.

Development

To get this application up and running you will need the following tools:

  • redis
    • brew install redis
    • Just let it run with default settings
    • If you want to run it manually - redis-server

Set your Redis server's URL for Routemaster using ROUTEMASTER_REDIS_URL. If you want to use another environment variable, set that and set the new environment variable's key as a value for REDIS_ENV_KEY environment variable.

Routemaster only accepts HTTPS calls. To get around this restriction on development, please install puma-dev.

Then proxy routemaster requests by running the following:

$ echo 17890 > ~/.puma-dev/routemaster

Now all your calls to https://routemaster.dev should correctly arrive at http://127.0.0.1:17890.

You will also need Routemaster to contact your app through HTTPS to deliver events. Follow the same steps above to proxy your app requests, i.e. for a Rails app that would be

$ echo 3000 > ~/.puma-dev/myapp

To run the Routemaster application locally you can use the foreman tool:

foreman start

This will start both the web server and ancillary processes. Keep in mind that the default web port that the web process will listen to is defined in the .env file. By default routemaster log level is set to DEBUG if this is too chatty you can easily configure this in the .env file


Advanced configuration

Metrics

Routemaster can report various metrics to a third party services by setting the METRIC_COLLECTION_SERVICE variable to one of:

  • print (will log metrics to standard output; the default)
  • datadog (requires the DATADOG_API_KEY and DATADOG_APP_KEY to be set)

The following gauge metrics will be reported every 10 seconds:

  • subscriber.queue.batches (tagged by subscriber queue)
  • subscriber.queue.events (tagged by subscriber)
  • jobs.count (tagged by queue and status)
  • redis.bytes_used, .max_mem, .low_mark, and .high_mark (the latter 3 begin the autodropper thresholds)
  • redis.used_cpu_user and .used_cpu_sys (cumulative CPU milliseconds used by the storage backend since boot)

as well as the following counter metrics:

  • events.published (tagged by topic)
  • events.autodropped (tagged by subscriber)
  • events.removed (idem)
  • events.added (idem)
  • delivery metrics, tagged by status ("success" or "failure") and by subscriber:
    • delivery.events (one count per event)
    • delivery.batches (one count per batch)
    • delivery.time (sum of delivery times in milliseconds)
    • delivery.time2 (sum of delivery times squared)
  • latency metrics, tagged by subscriber:
    • latency.batches.count (number of batch first delivery attempts)
    • latency.batches.first_attempt (sum of times from batch creation to first delivery attempt)
    • latency.batches.last_attempt (sum of times from batch creation to successful delivery attempt)
  • process (tagged with status:start or :stop, and type:web or :worker), incremented when processes boot or shut down (cleanly)

Exception reporting

Routemaster can send exception traces to a 3rd party by setting the EXCEPTION_SERVICE variable to one of:

For the latter two, you will need to provide the reporting endpoint in EXCEPTION_SERVICE_URL

Note that event delivery failures will not normally be reported to the exception service, as they're not errors with Routemaster itself.

To check delivery failures, one can:

  • monitor the routemaster.delivery.batches metrics with status:failure.
  • inspect the logs for failed to deliver.

Autodrop

Routemaster will, by default, permanently drop the oldest messages from queues when the amount of free Redis memory drops below a certain threshold. This guarantees that the bus will keep ingesting messages, and "clean up" behind listeners that are the latest / stale-est.

Autodrop is not intended to be "business as usual": it's an exceptional condition. It's there to address the case of the "dead subscriber". Say you remove a listening service from a federation but forget to unsubscribe: messages will pile up, and without autodrop the bus will eventually crash, bringing down the entire federation.

In a normal situation, this would be addressed earlier: an alert would be set on queue staleness, and queue size, and depending on the situation either the subscription would be removed or the Redis instance ramped up, for instance.

Set ROUTEMASTER_REDIS_MAX_MEM to the total amount of memory allocated to Redis, in bytes (100MB by default). This cannot typically be determined from a Redis client.

Set ROUTEMASTER_REDIS_MIN_FREE to the threshold, in bytes (10MB by default). If less than this value is free, the auto-dropper will remove messages until twice the treshold in free memory is available.

The auto-dropper runs every 30 seconds.

Scaling Routemaster out

  1. Allowing Routemaster to receive more events:
    This requires to scale the HTTP frontend. Procfile.
  2. Allowing Routemaster to deliver more events:
    This requires running multiple instances of the worker process. No auto-scaling mechanism is currently provided, so we recommend running the number of processes you'll require at peak.
    Note that event delivery is bounded by the ability of subscribers to process them. Poorly-written subscribers can cause timeouts in delivery, potentially causing buffering overflows.
  3. Allowing Routemaster to buffer more events:
    This requires scaling the underlying Redis server.

We recommend using HireFire to auto-scale the web and worker processes.

  • To scale the web processes, monitor the /pulse endpoint and scale up if it slows down beyond 50ms.
  • To scale the worker, we provide a special /pulse/scaling endpoint that will take 1s to respond when there are many queued jobs; we recommend to scale up when this endpoint it slow. See .env for configuration of thresholds.

Note that both endpoints require authentication.


API

Authentication, security.

Note that is preferable for all tokens to be prefixed with owning service and double hyphen. For example:

  • publishing-service-one--UUID1234XXX
  • subscribing-service-one--UUID1234XXX

Following that format of tokens will help ensure proper reporting of metrics.

All requests over non-SSL connections will be met with a 308 Permanent Redirect.

HTTP Basic is required for all requests. The password will be ignored, and the username should be a unique per client token.

All allowed clients are stored in Redis. A "root" token must be specified in the ROUTEMASTER_ROOT_KEY environment variable. This key has the permissions to add, delete, and list the client tokens. Other clients (publishers and subscribers) must use a token created by this user, at the /api_tokens endpoint (see below).

Listing allowed client tokens

Authenticating with the root key,

>> GET /api_tokens

<< [{ "name": <string>, "token": <string> }, ...]

Will return a 204 if no clients exist yet.

Adding a client

Authenticating with the root key,

>> POST /api_tokens
>> Content-Type: application/json
>>
>> { "name": <string> }

<< { "name": <string>, "token": <string> }

Returns status 201, generates a new API token, and returns it.

Deleting a client

Authenticating with the root key,

>> DELETE /api_tokens/:token

Always returns status 204, and deletes the API token if it exists. Note that this does not cause the corresponding subscriber (if any) to become unsubscribed.

Publication (creating topics)

There is no need to explicitly create topics; they will be when pushing the first event to the bus.

ONLY ONE CLIENT CAN PUSH EVENTS TO A TOPIC: all but the first client to push to a given topic will see their requests met with errors.

Pushing

>> POST /topics/:name
>> {
>>   type:      <string>,
>>   url:       <string>,
>>   timestamp: <integer>,
>>   data:      <anything>
>> }

:name is limited to 32 characters (lowercase letters and the underscore character).

<type> is one of create, update, delete, or noop.

The use case noop is to broadcast information about all entities of a concept, e.g. to newly created/connected subscribers. For instance, when connecting a new application for the first time, a typical use case is to perform an "initial sync". Given create, update, delete are only sent on changes in the lifecycle of the entity, this extra event can be sent for all currently existing entities.

<url> is the authoritative URL for the entity corresponding to the event (maximum 1024 characters, must use HTTPS scheme).

<timestamp> (optional) is an integer number of milliseconds since the UNIX epoch and represents when the event occured. If unspecified it'll be set by the bus on reception.

<data> (optional) is discouraged although not deprecated. It is intended when the RESN paradigm becomes impractical to implement — e.g. small, very frequently-changing representations that can't reasonably be fetched from the source and inconvenient to reify as changes in the domain (typically for storage reasons).

The response is always empty (no body). Possible statuses (besides authentication-related):

  • 204: Successfully pushed event
  • 400: Bad topic name, event type, invalid URL, or extra fields in the payload.
  • 403: Bad credentials, possibly another client is the publisher for this topic.

Subscription

Subscription implicitly creates a queue for the client, which starts accumulating events.

From the client's perspective, the subscription is a singleton resource. A client can therefore only obtain events from their own subscription.

>> POST /subscription
>> {
>>   topics:   [<name>, ...],
>>   callback: <url>,
>>   uuid:     <token>,
>>   timeout:  <t>,
>>   max:      <n>
>> ]

Subscribes the client to receive events from the named topics. When events are ready, they will be POSTed to the <url> (see below), at most every <t> milliseconds (default 500). At most <n> events will be sent in each batch (default 100). The <token> will be used as an HTTP Basic username (not password) to the client for authentication.

The response is always empty. No side effect if already subscribed to a given topic. If a previously subscribed topic is not listed, it will be unsubscribed.

Possible statuses:

  • 204: Successfully subscribed to listed topics
  • 400: Bad callback, unknown topics, etc.
  • 404: No such topic

Pulling

Clients receive an HTTPS request for new batches of events, they don't have to query for them. If the request completes successfully, the events will be deleted from the subscription queue. Otherwise, they will be resent at the next interval.

>> POST <callback>
>>
>> [
>>   {
>>     topic: <string>,
>>     type:  <string>,
>>     url:   <string>,
>>     t:     <integer>,
>>     data:  <anything>
>>   },
>>   ...
>> ]

All fields values are as described when publishing events, with the following caveats:

  • On delivery, the timestamp field is always present; and named t instead of timestamp.
  • The data field will be omitted if unspecified or null on publication.

Possible response statuses:

  • 200, 204: Event batch is ackownledged, and will be deleted from the subscription queue.
  • Anything else: failure, batch to be sent again later.

Note that if the subscriber doesn't respond to the HTTP request within ROUTEMASTER_TIMEOUT (or if the bus can't connect to it within ROUTEMASTER_CONNECT_TIMEOUT), the delivery will be considered to have failed and will be re-attempted.

Removing topics

Publishers can delete a topic they're responsible for:

>> DELETE /topic/:name

This will cause subscribers to become unsubscribed for this topic, but will not cause events related to the topic to be removed from the queue.

Unsubscribing

Subscribers can either unregister themselves altogether:

>> DELETE /subscriber

or just for one topic:

>> DELETE /subscriber/topics/:topic

Monitoring

Routermaster provides monitoring endpoints:

>> GET /topics
<< [
<<   {
<<     name:      <topic>,
<<     publisher: <username>,
<<     events:    <count>
<<   }, ...
<< ]

<count> is the total number of events ever sent on a given topic.

>> GET /subscriptions
<< [
<<   {
<<     subscriber:  <username>,
<<     callback:    <url>,
<<     max_events:  <value>,
<<     timeout:     <value>,
<<     topics:      [<name>, ...],
<<     events: {
<<       sent:       <sent_count>,
<<       queued:     <queue_size>,
<<       oldest:     <staleness>,
<<     }
<<   }, ...
<< ]
  • <name>: the names of all topics routed into this subscriptions queue.
  • <sent_count>: total number of events ever sent on this topic.
  • <queue_size>: current number of events in the subscription queue.
  • <staleness>: timestamp (seconds since epoch) of the oldest pending event.

Monitoring resources can be queries by clients using a client token or the root token.

Routemaster does not, and will not include a UI for monitoring, as that would complexify its codebase too much (it's a separate concern, really).


Post-MVP Roadmap

Latency improvements:

  • Option to push events to subscribers over routermaster-initiated long-polling requests
  • Option to push events to subscribers over client-initiated long-polling requests

Reliability improvements:

  • Ability for subscribers to specify retention period and/or max events retained.

Monitoring:

  • Separate monitoring application, with a UI, consuming the monitoring API and pushing to Statsd.

Data payloads:

  • Some use cases for transmitting (partial) representations over the event bus are valid (e.g. for audit trails, all intermediary representations must be know).

Support for sending-side autoscaling:

  • The watch currently is single-threaded, and running it in parallel loses the in-order delivery capability. We plan to address this with (optional) subscribed locking in the watch.
  • Support for HireFire-based autoscaling of watch processes.

Sources of inspiration

Docker

This project contains a Dockerfile and a Docker image is being built on every CI run to ensure smoother transition to a Docker-based architecture. Normally that step would not require any manual input from you as a developer but you may still want to manually check if your image builds or test any changes to the Dockerfile. Make sure you have Docker installed on your local machine and run the following command from the root of the project:

docker build -t routemaster .

If you want to get a shell on a Docker container built from this image, build the image first (see above), then run:

docker run --rm -it routemaster sh

To run routemaster locally with docker-compose, run:

docker-compose up

This will spin up redis, the routemaster background worker and the routemaster web worker on port 3000.

routemaster's People

Contributors

adamof avatar andycroll avatar bencord0 avatar buraks avatar crunch09 avatar deliveroo-pauric avatar deppbot avatar eparreno avatar gurkanoluc avatar humzashah avatar jeffreylo avatar joelvanvelden avatar jphastings avatar knaveofdiamonds avatar krisleech avatar marcinwyszynski avatar marcusleemitchell avatar matthutchinson avatar mezis avatar mottalrd avatar mrship avatar nas avatar pedrocunha avatar rxbynerd avatar sbozhko avatar sideshowcoder avatar tadejm avatar thejspr avatar tompave-roo avatar xarisd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

routemaster's Issues

How to upgrade Routemaster?

Add documentation on how to upgrade a live instance Routemaster.

Typically:

  • deploy a separate instance with the new version
  • subscribe all existing subscribers to the new instance*
  • change the bus address for all publishers
  • wait for the old instance to drain, then discard it

* this step can, and probably should be automated

Log all events to S3

Internally, having a special "sink" that logs all events to S3, to enable:

  • replay capabilities
  • digestion by other tools (e.g. data processing for analytics)

Up for debate - maybe this is something we want as a client of the bus itself.

Switch from Unicorn to Puma

We want robust concurrency for event ingestion.
Incidentally, the faster boot times provided by Puma will make our integration tests a bit faster!

This will require

  • a fairly thorough code review for thread-safety
  • Redis connection pooling

Auto-drop messages

When a subscriber stops accepting messages for any reason (it's not up yet; it's dead for a long period), messages queues can build up, kill the bus, and thus kill the entire infrastructure.

Introduce a new dyno process that removes the oldest messages across all queues until all of the following conditions is met:

At least 10MB of Redis memory is free (configurable). This will require calling INFO, and a configuration parameter that specifies what the total Redis memory is (as the CONFIG command is usually unavailable on hosted Redises).

In Redis style, it's okay to take shortcuts when implementing: instead of iterating through queues to find the oldest message and remove them one at a time, it's okay to find the stalest queue and remove a bunch of messages from it with a few heuristics (e.g. up to the point where the queue is the second stalest).

Note that because messages don't have timestamps (only events in messages do), it's probably sensible to introduce message timestamps here (probably the time of enqueuing). Options include making the list of message UIDs a ZSET scored by timestamp, or maintaining a hash of timestamps alongside the existing data structures.

More metrics

Add more metrics:

  • rate of message ingestion, per topic
  • rate of delivered messages, per queue
  • rate of delivery failures per queue
  • rate of auto-drop, per queue (useful to alert on)
  • free bytes of memory (according to autodrop's memory measurement)
  • count of non-ack'd / pending messages, per queue

Given we do not want to add statsd/dogstatsd as a runtime requirement (makes for huge buildpacks), we'll probably want to do our own aggregation for the "rate" metrics.

Payloads in events

There's a growing number of use cases where the "pure RESTful" approach of only sending events on the bus doesn't work (e.g. high-frequency updates where the subscribers might not manage to fetch intermediary representations).

Add support for Routemaster to accept and transmit resource representations in JSON.

Implementation notes:

  • Auto-compress event data over 5kB.

Zero-copy queues

Don't actually copy messages to multiple queues; instead, keep a single copy with a reference counter.

This will significantly improve memory usage for larger federations (2:1 if each topic 2 subscribers, for instance).

Concurrent delivery

The watch process is currently single threaded, which

  • is inefficient (most of the time is spent waiting for delivery to complete)
  • cause timing issues (#32 is hard to implement, as the watch waits for the earliest subscriber to deliver next between runs, and attempts delivery to all subscribers at each iteration).

Make it multi-threaded and/or reactive, and trigger delivery in a timely fashion (according to time_to_next_run).

Fill batches greedily

Currently, we prefer to deliver partial batches immediately if they're stale, rather than delivering full batches.

Given that popping from queues is immensely faster than delivering in the general case, prefer filling batches first.

Catch delivery errors

When the Faraday adapter fails to deliver a message, the watch process may blow up.

Catch and log Faraday::Error::ClientError, and re-raise CantDeliver.

This should handle timeouts, SSL errors, DNS issues; and a test case should be provided for each.

Subscriber/subscription refactor

The internal "subscriber" and "subscription" terminology is confusing, and the Subscribers model (collection of subscriptions for a topic, currently) is unwieldy.

Particularly, it's inconvenient to both list topics from subscriptions and list subscriptions from topics.

  • Rename "subscription" to "subscriber" globally.
  • Replace "subscribers" with a model materialising subscriptions (Topic↔︎Subscriber relationship).
  • Replace adding/removing subscriptions (incl. the UpdateSubscriptionTopics service) with creation and deletions of the new model.

Fix "sent" metric

The GET /subscription endpoint uses Subscription#all_topics_count to measure events sent for a given queue, which incorrectly sums events sent for all registered topics. This may include events sent before subscription.

Fix to count events actually sent.

New Relic exception adapter

We can currently deliver exceptions to Sentry, Honeybadger, or logs.

Given we support New Relic for performance monitoring, it'd be sensible to also report exceptions there.

(Tracked internally as IP-282)

Datadog monitoring integration

Monitoring integration needs fixing.

  • It's currently broken.
  • It's design to ping every 5 seconds, from every web worker, which is unnecessary (should be a separate process for now).
  • It doesn't have any test coverage.

Suggestion / request: use Dependabot to keep dependencies up-to-date

First of all, thanks for Routemaster!

I've got a suggestion / request: would you be up for using Dependabot to automatically create dependency update PRs for this repo? I ran it against my fork and it generated these PRs. I'll port the ffi one across to this repo now because it's security related.

I built Dependabot, but I'm honestly only suggesting it because I hope it can save you some time. I'd love any feedback, and obviously having open source repos using Dependabot helps boost its profile, but if it's not helpful to you then it's not really worth anything.

You can install it from here or here if you decide to give it a try. It's been through GitHub's security testing (to be allowed in the GitHub Marketplace) and is used by a few thousand organisations, and the source code is here.

:octocat:

Update subscription API

The current subscription API is confusing, because it lets client both edit the subscriber and update the list of its subscriptions (subscribed topics).

  • Deprecate POST /subscription
  • Add the following new endpoints:
    • PUT /subscriber (edits metadata only, eg. callback URL)
    • GET /subscriber
    • GET /subscriber/topics
    • PUT /subscriber/topics/:name
    • DELETE /subscriber/topics/:name

Also update the client, to support an explicit add/remove topic interface.

Back off when a listener is dead

When delivery fails (on whichever error, e.g. 500 from the subscriber), we instantly retry (when the batching deadline has been reached).

Implement a backoff mechanism so that retries happen after 100ms, 200ms, 400ms, 800ms, 1600ms, 3200ms, 6400ms (no higher).

Not a bug per se (everything still works) but this keeps the watch process busy retrying, at 100% CPU. It does still service oether subscribers.

How to use multiple Redis instances when high traffic elevates Redis CPU usage

Under various strain scenarios (such as failing subscribers or many events going through the bus), Routemaster experiences high Redis CPU usage. This indicates that engineers cannot keep using Routemaster if their projects are continuously growing.

Because Redis is single-threaded, adding more CPUs does not help - we need to figure out a way to use multiple Redis instances.

Looking at the code, the concept of a 'single Redis' seems to be tightly coupled with the rest of the Routemaster architecture. A topic based Redis sharding, for example, would involve a significant refactor. In fact, any kind of refactor in the application that involves multiple Redis instances would result in significant changes and, in my opinion, an untidy end-result as that is not what the application was originally designed for.

One idea is to have place a proxy in front of multiple Routemaster applications, which each application handling a subset of topics. We can hand a separate Redis to 'high traffic' topics and reduce the Redis CPU usage. If a single topic ever becomes 'too high traffic', we could split it into multiple Redis instances. The caveat is that we would no longer guarantee the minimum batch size and minimum batch time to the subscribers.

Since this is an open-source project, @tompave-roo and I thought it best to discuss it here and invite the original author for comment.

Hi, @mezis. 🙂 Your thoughts on the above would be immensely appreciated.

Enforce delivery timeouts

Currently, if the delivery callback takes 30s to run, the watch process will block while waiting for receivers to process events.

By design, we want receivers to implement their own deferred processing as necessary: holding HTTP connections open for too long comes with many constraints that we do not want the bus to be responsible for (it's primarily a bus, not a queue, and we don't want to manage long queues, handle too-smart scrubbing if the processes die while waiting for slow deliveries, handle scaling concerns because of slow listeners, etc).

Set non-configurable timeouts to:

  • 2 seconds for opening an HTTP connection;
  • 3 seconds to receive a response.

(the intent is that the total 5 seconds be lower than the SIGTERM grace period.)

(Tracked internally as IP-284)

Scrubbing the nack queue

If the watch process dies uncleanly for any reason, it's possibly that un-acked messages will still be in the "pending" queues.

Add a process that regularly scrubs "pending" queues (ie. auto-nacks the messages).

Given messages can stay "pending" for the subscription batching timeout (t1) + the event delivery HTTP timeout (t2, not yet enforced), it's reasonable to scrub any pending messages that are older than 4 ×(t1 + t2).

(Tracked internally as IP-283)

Scalable event storage

Why

  • we want to allow subscribers to "catch up" long after the event if required
  • payloads in events (#45) means the stored data volume will increase
  • the throughput might cause too much load for a single Redis

What

Two approaches.

\1. Preserving the ability for atomic queue updates (i.e. Lua atoms still possible)

  • Queue event lists time-partitioned
  • Master list of partitions/segments per queue
  • Subscribers locate the "oldest" partition and pop from there
  • Event data replicated with each partition (ie. same Redis hash slot)

This increases storage requirements as the data doesn't get shared across queues (conflicts with #18).

\2. Favouring storage

  • Queue event lists time-partitioned as above
  • Event payloads held in a partitioned store (with no copies, with reference counting)
  • Regular events data scrubbing to address the refcounting write-hole (doable efficiently with sorted sets, and removing any event older than all queues)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.