Giter VIP home page Giter VIP logo

semian's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

semian's Issues

Feature Request: Percentage Based Error Thresholds

What

Currently, we express error thresholds as the number of failures (error_threshold) in a certain time period (error_timeout). After that threshold is reached, we open the circuit, and only close it again after a certain number of successful requests (success_threshold) are reached.

This requires intimate knowledge of your request patterns. A more flexible model is to use an error percentage threshold to determine when to open the circuit. Instead of saying 3 failures in 5 seconds, one might say over 10% of requests failed.

How

Either add a new parameter, error_percent_threshold or allow error_threshold to be expressed as a percentage (e.g. "10%").

Maintain either a large sliding window of successes and errors to compute percentages, or perhaps a set of counters to reduce the overall size of the windows.

Detect Mysql2 connection errors

We should increment the errors count of the circuit breaker for MySQL connection errors (similar to #37)

But MySQL have only one exception class, so it's slightly dirty. Maybe we could improve the situation upstream.

Ref: brianmario/mysql2#404

mysql2: Exceptions are "wrapped" infinitely

We are trying to use Semian for our application. The Net::HTTP adapter works perfectly and I also wrote a custom adapter for logstash-logger.

Unfortunately, the MySQL adapter is behaving strangely. I was able to reproduce it in a small script:

require 'active_record'
require 'semian'
require 'semian/mysql2'

db_config = {
  adapter: 'mysql2',
  pool: 8,
  timeout: 2,
  host: 'toxiproxy',
  port: 3307,
  database: '...',
  username: '...',
  password: '...',
  reconnect: true,
  connect_timeout: 10,
  read_timeout: 10,
  semian: {
    name: 'test-db',
    bulkhead: false,
    success_threshold: 1,
    error_threshold: 3,
    error_timeout: 10
  }
}

ActiveRecord::Base.establish_connection(db_config)

loop do
  begin
    sleep 1
    puts ActiveRecord::Base.connection.execute('select now()')
  rescue StandardError => exc
    puts exc
  end
end

The output is:

# ruby test-semian.rb 
[mysql_test-db] Can't connect to MySQL server on 'toxiproxy' (111)
[mysql_test-db] Can't connect to MySQL server on 'toxiproxy' (111)
I, [2018-12-07T09:59:32.641238 #65]  INFO -- : [Semian::CircuitBreaker] State transition from closed to open. success_count=0 error_count=3 success_count_threshold=1 error_count_threshold=3 error_timeout=10 error_last_at="1544173172"
[mysql_test-db] Can't connect to MySQL server on 'toxiproxy' (111)
[mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Can't connect to MySQL server on 'toxiproxy' (111)
[mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Can't connect to MySQL server on 'toxiproxy' (111)
[mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Can't connect to MySQL server on 'toxiproxy' (111)
[mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Can't connect to MySQL server on 'toxiproxy' (111)
[mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Can't connect to MySQL server on 'toxiproxy' (111)
[mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Can't connect to MySQL server on 'toxiproxy' (111)
[mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Can't connect to MySQL server on 'toxiproxy' (111)
[mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Semian::OpenCircuitError caused by [mysql_test-db] Can't connect to MySQL server on 'toxiproxy' (111)
^Ctest-semian.rb:35:in `sleep': Interrupt

The circuit-breaker opened nicely as configured, but then the exception message is getting longer and longer. This doesn't happen with the Net::HTTP adapter.

Is anything wrong with the configuration? Or is it because we are using ActiveRecord?

I am using the following versions:

ruby 2.3.6

semian (0.8.5)
mysql2 (0.5.1)
activerecord (5.1.6)

(I also tested with ruby 2.5.3 and mysql2 0.5.2 with the same result.)

How do I use Semian (for dummies)

I am having trouble to understand how to use the Semian gem to implement a circuit breaker around a HTTP call. I went through the README multiple times. As far as I understand, I need to use the NetHTTP adapter, but where do I put the code?

The HTTP urls I want to wrap in circuit breaker are in separate models like the following:

class Demo
  def call_service_1 do
    return Net::HTTP.get_response("service1_url.com").body
  end

I have multiple models like these calling service 1, 2, 3 etc. I just want to enable circuit breaker for the Urls, and not turn the whole classes to adapters. How can I achieve this?

Note: I am new to Ruby, so maybe you need to give me some more context.

Simulator

Getting good resiliency parameters is hard. I have some ideas on the maths here (esp for bulkheads), but writing something to simulate traffic + an architecture + failing components could be an interesting way to optimize. Or maybe just being better at maths than I am.

Disappearing semaphore array

I've run into an issue with a semaphore array disappearing from the system while the app is running. I can't find any details in the logs about what would cause it, but it started happening pretty much as we moved from ubuntu 14.04 to 18.04. There are no other changes that I could see that would be related here.

The system is running with ruby 2.5.5. The exception we get is:

Semian::SyscallError: semop() failed, errno: 22 (Invalid argument)

File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/protected_resource.rb line 50 in acquire
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/protected_resource.rb line 50 in acquire_bulkhead
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/protected_resource.rb line 24 in block in acquire
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/protected_resource.rb line 38 in block in acquire_circuit_breaker
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/circuit_breaker.rb line 141 in maybe_with_half_open_resource_timeout
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/circuit_breaker.rb line 30 in acquire
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/protected_resource.rb line 37 in acquire_circuit_breaker
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/protected_resource.rb line 23 in acquire
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/adapter.rb line 34 in acquire_semian_resource
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/net_http.rb line 83 in connect

This is with the latest released semian.

The issue starts occurring a number of hours after the deployment, without any obvious pattern of traffic.

I tracked the call down to:

10574.300 ( 0.015 ms): ruby/21041 semtimedop(semid: 131072, tsops: 0x7ffff92379c2, nsops: 1, timeout: 0x7ffff9237aa8) = -1 EINVAL Invalid argument

where the semid: 131072 doesn't exist on the system (normally we have 2 semaphore arrays, but this system had only 1). This was validated using ipcs -s.

Please let me know if there's any more debugging information I can provide.

Toxiproxy defined twice as a dev dependency

Noticed that toxiproxy is defined twice in the Gemfile:

  • the gemspec as a dev depenendecy (current gem release)
  • under the development group (git release)

This may cause some side effects with version bumps moving forward.

Subscribe to 'resource timeout' exceptions

Heya,
I've implemented the instrumentation for keeping an eye on the status of semian adapters however one that would helpful to add would be the 'timed out waiting for resource' exceptions which are raised from the C extension. This would be handy to see when configuring ticket counts and trying to find that sweet spot for your application.

I'm not sure how to best achieve (or even if it's possible) but if someone would like to point me in the right direction, I'm more than happy to take a swing at it.

Thanks!

Ruby 2.7 Warnings

If a project is running ruby 2.7 there are some depreciation warnings. Here are some that pop up in my project for keyword parameters:

/Users/michaelmenanno/.gem/ruby/2.7.1/gems/semian-0.10.1/lib/semian/mysql2.rb:120: warning: Using the last argument as keyword parameters is deprecated; maybe ** should be added to the call
/Users/michaelmenanno/.gem/ruby/2.7.1/gems/semian-0.10.1/lib/semian/adapter.rb:32: warning: The called method `acquire_semian_resource' is defined here
/Users/michaelmenanno/.gem/ruby/2.7.1/gems/semian-0.10.1/lib/semian.rb:251: warning: Using the last argument as keyword parameters is deprecated; maybe ** should be added to the call
/Users/michaelmenanno/.gem/ruby/2.7.1/gems/semian-0.10.1/lib/semian.rb:290: warning: The called method `require_keys!' is defined here
/Users/michaelmenanno/.gem/ruby/2.7.1/gems/semian-0.10.1/lib/semian/simple_sliding_window.rb:48: warning: Using the last argument as keyword parameters is deprecated; maybe ** should be added to the call
/Users/michaelmenanno/.gem/ruby/2.7.1/gems/semian-0.10.1/lib/semian/simple_sliding_window.rb:15: warning: The called method `initialize' is defined here

Feature Request: Throttle half_open -> closed attempts

What

Currently, when the error_timeout expires, the next acquisition request for a circuit will cause a transition from open to half_open. In this state, workers will attempt to access the resource with a modified timeout of half_open_resource_timeout. The motivation here is that the modified timeout is much lower than the client timeout so if the resource is still unhealthy, it will fail fast(er).

In the current implementation, every available worker (subject to the bulkhead configuration) will attempt the half_open -> closed transition. This means that if the resource is still unhealthy, all the workers could potentially block for half_open_resource_timeout seconds, reducing overall node capacity.

Mathematically, this means that t[half-open] / (t[half-open] + t[error_timeout]) will be spent attempting to re-open the circuit. If t[half-open] is 1.0s and t[error-timeout] is 5.0s (our MySQL defaults) then 16.7% of our capacity will go toward re-opening the circuit. If bulkheads are in place with a quota of 0.5, that number will be 8.3%.

How

When a circuit opens, the number of available tickets should immediately drop to 1. This shields the rest of the workers from this unhealthy resource. This is marginally faster than the open circuit error, since bulkhead acquisition is attempted before circuit-breaker acquisition, but that's likely not a big deal.

When the transition happens from open to half_open, we can raise the number of available tickets to success_threshold, to allow parallel re-closing of the circuit. Once the circuit is finally re-closed, we can raise the number of available tickets back to the original tickets/quota value.

Drop openssl dependency

We can likely just use something built into libc to hash the resources. It'd be nice to have collision detection, but it's not really a must-have since it's so unlikely.

Net::ResourceBusyError Permission denied

Hi,
I have question about using Semian.
I configured simple NetHttp adapter (like in README):
config/initializers/semian.rb

SEMIAN_PARAMETERS = { 
  tickets:           ENV['SEMIAN_TICKETS'].to_i,
  success_threshold: ENV['SEMIAN_SUCCESS_THRESHOLD'].to_i,
  error_threshold:   ENV['SEMIAN_ERROR_THRESHOLD'].to_i,
  error_timeout:     ENV['SEMIAN_ERROR_TIMEOUT'].to_i
}.freeze

Semian::NetHTTP.exceptions += [::OpenSSL::SSL::SSLError]
Semian::NetHTTP.semian_configuration = proc do |host, _port|
  case(host)
  when 'site1.com')
    SEMIAN_PARAMETERS.merge(name: 'site_1')
  when 'site2.com')
    SEMIAN_PARAMETERS.merge(name: 'site_2')
  else
    nil
  end
end

I tested it on development, and all was looks good. But problem was shows on after production deployment. When I try execute http request from rails console then I getting error:

2.3.6 :090 > RestClient.get('https://site1.com')
Net::ResourceBusyError: [nethttp_site_1] semget() failed, errno: 13 (Permission denied)
	from /usr/local/rvm/gems/ruby-2.3.6/gems/semian-0.8.3/lib/semian/adapter.rb:40:in `rescue in acquire_semian_resource'
	from /usr/local/rvm/gems/ruby-2.3.6/gems/semian-0.8.3/lib/semian/adapter.rb:32:in `acquire_semian_resource'
	from /usr/local/rvm/gems/ruby-2.3.6/gems/semian-0.8.3/lib/semian/net_http.rb:83:in `connect'
	from /usr/local/rvm/rubies/ruby-2.3.6/lib/ruby/2.3.0/net/http.rb:863:in `do_start'
	from /usr/local/rvm/rubies/ruby-2.3.6/lib/ruby/2.3.0/net/http.rb:852:in `start'
	from /usr/local/rvm/gems/ruby-2.3.6/gems/rest-client-2.0.2/lib/restclient/request.rb:715:in `transmit'
	from /usr/local/rvm/gems/ruby-2.3.6/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute'
	from /usr/local/rvm/gems/ruby-2.3.6/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute'
	from /usr/local/rvm/gems/ruby-2.3.6/gems/rest-client-2.0.2/lib/restclient.rb:67:in `get'
	from (irb):90
	from /usr/local/rvm/gems/ruby-2.3.6/gems/railties-4.2.5/lib/rails/commands/console.rb:110:in `start'
	from /usr/local/rvm/gems/ruby-2.3.6/gems/railties-4.2.5/lib/rails/commands/console.rb:9:in `start'
	from /usr/local/rvm/gems/ruby-2.3.6/gems/railties-4.2.5/lib/rails/commands/commands_tasks.rb:68:in `console'
	from /usr/local/rvm/gems/ruby-2.3.6/gems/railties-4.2.5/lib/rails/commands/commands_tasks.rb:39:in `run_command!'
	from /usr/local/rvm/gems/ruby-2.3.6/gems/railties-4.2.5/lib/rails/commands.rb:17:in `<top (required)>'
	from bin/rails:4:in `require'
	from bin/rails:4:in `<main>'

In this time Semian.resources return blank hash.

Could you explain me what is wrong? I don't understand this error.

Platform check not sufficiently permissive

@sirupsen @csfrancis I was trying to use semian on an x64 Ubuntu 14.10 installation and I got Semian is not supported on x86_64-linux-gnu - all operations will no-op

The current check is just end_with?('-linux') which clearly isn't sufficient, though I don't know enough about the format of RUBY_PLATFORM to say what else is needed.

Update instrumentation docs

Semian instruments state changes in circuit breakers, but the current documentation does not reflect this:

Semian.notify(:state_change, self, nil, nil, state: new_state)

# `event` is `success`, `busy`, `circuit_open`.
# `resource` is the `Semian::Resource` object
# `scope` is `connection` or `query` (others can be instrumented too from the adapter)
# `adapter` is the name of the adapter (mysql2, redis, ..)
Semian.subscribe do |event, resource, scope, adapter|
  StatsD.increment("semian.#{event}", 1, tags: {
    resource: resource.name,
    adapter: adapter,
    type: scope,
  })
end

I might prepare a PR as soon as I find some spare time.

Adapter to gem httprb

Adapter to httprb (https://github.com/httprb/http) gem

I made the adapter but could't submit a PR

require 'semian/adapter'
require 'http'

module Semian
  module HTTPrb
    include Semian::Adapter

    class SemianError < ::HTTP::Error
      def initialize(semian_identifier, *args)
        super(*args)
        @semian_identifier = semian_identifier
      end
    end

    class HTTPResponseError < ::HTTP::Error
      attr_reader :response

      def initialize(response)
        super("#{response.code} #{response.reason}")
        @response = response
      end
    end
    ResourceBusyError = Class.new(SemianError)
    CircuitOpenError = Class.new(SemianError)

    class SemianConfigurationChangedError < RuntimeError
      def initialize(msg = "Cannot re-initialize semian_configuration")
        super
      end
    end

    def semian_identifier
      "httprb_#{raw_semian_options[:name]}"
    end

    DEFAULT_ERRORS = [
        ::SocketError,
        ::HTTP::ConnectionError,
        ::HTTP::RequestError,
        ::HTTP::ResponseError,
        ::HTTP::StateError,
        ::HTTP::TimeoutError,
        ::HTTP::HeaderError,
        ::EOFError,
        ::IOError,
        ::SystemCallError, # includes ::Errno::EINVAL, ::Errno::ECONNRESET, ::Errno::ECONNREFUSED, ::Errno::ETIMEDOUT, and more
        Semian::HTTPrb::HTTPResponseError,
    ].freeze

    class << self
      attr_accessor :exceptions
      attr_reader :semian_configuration

      @uri = nil

      def semian_configuration=(configuration)
        raise Semian::HTTPrb::SemianConfigurationChangedError unless @semian_configuration.nil?
        @semian_configuration = configuration
      end

      def retrieve_semian_configuration(host, port)
        @semian_configuration.call(host, port) if @semian_configuration.respond_to?(:call)
      end

      def reset_exceptions
        self.exceptions = Semian::HTTPrb::DEFAULT_ERRORS.dup
      end
    end

    Semian::HTTPrb.reset_exceptions

    def raw_semian_options
      @raw_semian_options ||= begin
        uri_match = @uri.scan(URI::DEFAULT_PARSER.make_regexp)[0]
        host = uri_match[3]
        port = uri_match[4]
        path = uri_match[6]
        @raw_semian_options = Semian::HTTPrb.retrieve_semian_configuration("#{host}#{path}", port)
        @raw_semian_options = @raw_semian_options.dup unless @raw_semian_options.nil?
      end
    end

    def resource_exceptions
      Semian::HTTPrb.exceptions
    end

    def disabled?
      raw_semian_options.nil?
    end

    def request(verb, uri, opts = {})
      @uri = uri
      return super(verb, uri, opts) if disabled?
      begin
        acquire_semian_resource(adapter: :http, scope: :connection) do
          response = super(verb, uri, opts)
          raise HTTPResponseError.new(response) if response.status.server_error?
          response
        end
      end
    end

    private

    def handle_error_responses(result)
      if raw_semian_options.fetch(:open_circuit_server_errors, false)
        semian_resource.mark_failed(result) if result.is_a?(::HTTP::Error)
      end
      result
    end

  end
end

HTTP::Client.prepend(Semian::HTTPrb)

Throw exception when protected resource deligate is nil.

Because we can now disable bulkheads or circuit_breakers for a protected_resource, we should raise an exception trying to access methods of a nil delegate.

For example, with bulkhead: false, resource.count should raise a BulkheadDisabledError rather than a NoMethodError kinda thing.

We should also add something to the docs that recommends using resource.bulkhead.count over resource.count.

cc @sirupsen @csfrancis

New documentation

We're ready to write some good documentation now that we've extracted the useful parts from Shopify and use it in production.

  • Basic usage (#42)
  • How does Semian work?
  • Decorator pattern for rescuing exceptions
  • rdoc

\cc @byroot

Semian sysv semaphores are not supported on x86_64-darwin21

While trying to figure out how to use Semian NetHTTP adapter, I found this INFO log.
INFO -- : Semian sysv semaphores are not supported on x86_64-darwin21 - all operations will no-op

I was not sure whether this is concerning, so I opened an issue.

Proxy

If Semian was a proxy it'd be simpler to use, and could be opt-out instead of opt-in for apps in an organization.

Some inspriation could be taken from: https://github.com/vektra/templar

It possibly would be re-implemented in something like Go. This also solves issues such as sharing state between containers.

Broken subscribe interface

Following the monitoring instructions in the readme:

# `event` is `success`, `busy`, `circuit_open`.
# `resource` is the `Semian::Resource` object
# `scope` is `connection` or `query` (others can be instrumented too from the adapter)
# `adapter` is the name of the adapter (mysql2, redis, ..)
Semian.subscribe do |event, resource, scope, adapter|
  StatsD.increment("Shopify.#{adapter}.semian.#{event}", 1, tags: [
    "resource:#{resource.name}",
    "total_tickets:#{resource.tickets}",
    "type:#{scope}",
  ])
end

Results in an Error:

NoMethodError: undefined method `tickets' for #<Semian::CircuitBreaker:0x0000558e2ec77ba0>

It seems the resource is a Semian::CircuitBreaker, and not a Semian::Resource!

Introduced by: #238

Semian Version: 0.8.9

Possible concurrency issues

@casperisfine @sirupsen

I was poking around and found what I think are a few (hypothetical) issues when some semian methods are called concurrently. It's possible I'm missing something in the architecture (e.g are adapter methods expected to always be mutexed by the parent driver the way redis is?) but if not then here are some things I noticed:

  • multiple concurrent calls to a new adapter can end up racing in retrieve_or_register creating multiple instances of the same circuit-breaker and protected resource
  • various issues in the circuit-breaker, e.g. concurrent calls to request_allowed? can "lose" successful responses by triggering multiple transitions to the half-open state

Fault injection

I've been reading about fuse, a mature circuit breaker library for Erlang (a platform known for "resiliency by default").

In circuit breakers configuration, they have two fuse types (you can think of them similar to toxics in Toxiproxy):

  • Standard fuses, {standard, MaxR, MaxT}. These are fuses which tolerate MaxR melt attempts in a MaxT window, before they break down.
  • Fault injection fuses, {fault_injection, Rate, MaxR, MaxT}. This fuse type sets up a fault injection scheme where the fuse fails at rate Rate, an floating point value between 0.0–1.0. If you enter, say 1 / 500 then roughly every 500th request will se a blown fuse, even if the fuse is okay. This can be used to add noise to the system and verify that calling systems support the failure modes appropriately. The values MaxR and MaxT works as in a standard fuse.

IMO, the idea of injecting faults through a circuit breaker is brilliant. Not every organization has adopted chaos engineering yet, but this could be a first step towards that, at least on the application level.

We should think about adopting this idea in Semian. The biggest concern would probably be development environment vs production: do we inject faults when it's running locally or on CI? If yes, how do we prevent test flakiness? Or should we do this only in production?

Exponential backoff

Something I've been looking into lately is how we can combat the stampeding herd effect we occasionally incur once a system has recovered and it is able to receive traffic again. One approach I've explored is using expotential backoff and I was looking to find out if this is something you'd consider adding to semian? I think semian is a sensible place to put this because it already has knowledge of the tickets/quotas, error rates and could use it's already available data to make decisions on how much to push out the backoff by without needing to query another resource.

Also open to hearing about how you've addressed this at Shopify if you've got a good handle on it in other ways πŸ˜„

Semian library for nodejs

Do we have a library in nodejs environment for Semian and circuit breaker so that we can use in nodejs environment with toxiproxy

Configure tickets per workers, not staticaly

A big pain point of configuring semian, is that you might have host with different amounts of processes.

It could be interesting to investigate a way to configure Semain to dynamically define the number of tickets based on the processes count.

e.g. I want ceil(0.2 * process_count) tickets.

cc @sirupsen

Updating the docs to specifically call out some common error messages

Hello all!

I've been looking through the docs and googling and couldn't figure out what these errors actually mean, so I wanted to just ask.

I'm happy to make a PR to update the readme if anyone has time to help me figure these out. I think it would help others as well!


Error: Net::ReadTimeout/Net::OpenTimeout

This one below, I'm fairly confident is caused by the connection exceeding the read_timeout or open_timeout options in net/http.

[nethttp_example.com_443] Semian::OpenCircuitError caused by Net::ReadTimeout

Error: timed out waiting for resource

This one is related to https://github.com/Shopify/semian/blob/master/ext/semian/resource.c#L60, but I'm not totally certain what EAGAIN actually means in this context or why this would be caused. https://stackoverflow.com/a/28868162.

My best guess is that when trying to acquire the semaphore, the underlying OS didn't give one in time so Semian gave up. Is this something that one can even fix?

[nethttp_example.com_443] Semian::OpenCircuitError caused by timed out waiting for resource 'nethttp_example.com_443'

Error: execution expired

Okay, this last one I'm super stumped by. It seems to be related to https://github.com/ruby/ruby/blob/master/lib/timeout.rb#L94, but I'm really struggling to track back to what could possibly cause this if it's not Net::HTTP. Unless maybe it's Rack timing out the entire Ruby process or something? Super open to ideas.

[nethttp_example.com_443] Semian::OpenCircuitError caused by execution expired

Thank you for any pointers or advice. Once I feel like I have a basic understanding, I'll be happy to make a PR to describe these and do all the word smithing! Thank you!

Failed to compile ext with multiple definition errors

$ uname -a
Linux orion.dev 5.5.7-200.fc31.ppc64le #1 SMP Fri Feb 28 17:07:46 UTC 2020 ppc64le ppc64le ppc64le GNU/Linux

$ gcc --version
gcc (GCC) 10.0.1 20200216 (Red Hat 10.0.1-0.8)
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ ld --version
GNU ld version 2.34-2.fc32
Copyright (C) 2020 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or (at your option) a later version.
This program has absolutely no warranty.

$ make --version
GNU Make 4.2.1
Built for powerpc64le-redhat-linux-gnu
Copyright (C) 1988-2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

$ make
compiling semian.c
compiling tickets.c
linking shared-object semian/semian.so
/usr/bin/ld: semian.o:(.bss+0x60): multiple definition of `eSyscall'; resource.o:(.bss+0x28): first defined here
/usr/bin/ld: semian.o:(.bss+0x58): multiple definition of `eTimeout'; resource.o:(.bss+0x20): first defined here
/usr/bin/ld: semian.o:(.bss+0x50): multiple definition of `eInternal'; resource.o:(.bss+0x18): first defined here
/usr/bin/ld: semian.o:(.bss+0x48): multiple definition of `id_wait_time'; resource.o:(.bss+0x10): first defined here
/usr/bin/ld: semian.o:(.bss+0x40): multiple definition of `id_timeout'; resource.o:(.bss+0x8): first defined here
/usr/bin/ld: semian.o:(.bss+0x38): multiple definition of `system_max_semaphore_count'; resource.o:(.bss+0x0): first defined here
/usr/bin/ld: sysv_semaphores.o:(.bss+0x10): multiple definition of `eSyscall'; resource.o:(.bss+0x28): first defined here
/usr/bin/ld: sysv_semaphores.o:(.bss+0x0): multiple definition of `eInternal'; resource.o:(.bss+0x18): first defined here
/usr/bin/ld: sysv_semaphores.o:(.bss+0x8): multiple definition of `eTimeout'; resource.o:(.bss+0x20): first defined here
/usr/bin/ld: tickets.o:(.bss+0x10): multiple definition of `eSyscall'; resource.o:(.bss+0x28): first defined here
/usr/bin/ld: tickets.o:(.bss+0x8): multiple definition of `eTimeout'; resource.o:(.bss+0x20): first defined here
/usr/bin/ld: tickets.o:(.bss+0x0): multiple definition of `eInternal'; resource.o:(.bss+0x18): first defined here
collect2: error: ld returned 1 exit status
make: *** [Makefile:261: semian.so] Error 1

The program could be compiled if I set LDFLAG explicitly with --allow-multiple-definition

Wondering if the team could resolve these warnings sot that semian could be compiled without --allow-multiple-definition flag?

Opt out of ticket count feature

What

It would be good to be able to configure semian without having to specify a ticket count, or to be able to explicitly disable the ticket count feature altogether.

Why

You might not want to limit the number of concurrent requests to a resource. This is the case in shopify-app-store, where we don’t want to limit the number of concurrent requests to the Shopify API.

ENV['SEMIAN_SEMAPHORES_DISABLED'] and SEM_UNDO

Hi, I'm trying to get my head around the relationship between the "SEMIAN_SEMAPHORES_DISABLED" environment variable whether semaphores are actually utilized by the library. I wholeheartedly admit that most of my confusion is due to my lack of understanding about Ruby's C extensions, so my apologies for that. Here's my question though...

If I set SEMIAN_SEMAPHORES_DISABLED=1 then the following if statement should execute the else block:

https://github.com/Shopify/semian/blob/master/lib/semian.rb#L177

if Semian.semaphores_enabled?
  require 'semian/semian'
else
  Semian::MAX_TICKETS = 0
end

If that's the case does that mean that the C extension code is not pulled in? Assuming yes, does that mean that semaphore calls will not be performed?

Ultimately, I ask this to determine whether the SEM_UNDO threading issue would be avoided in this case. Am I barking up the wrong tree?

Dynamic semaphore initialization from ticket quota

The gist of what is proposed here is to allow us to eliminate the assumption that there are a fixed number of workers (resource consumers) on a particular host. In a more dynamic scheduling environment (think: kubernetes), we cannot be certain of the number of resource consumers on a given host.

This is problematic, because under the current model we would have a ticket quota of a fixed size for a fixed number of workers. For illustration:

  • Assume we have a resource that permits 5 tickets (T) -> T = 5
  • Assume we have 10 workers (W) -> W = 10
  • In this case, a quota (Q) of only half of the workers may access the resource at a time - > Q = 0.5

So, since W is no longer static, we need T to be able to react to it, such to preserve Q at 0.5.

The proposed implementation is as follows:

  • Have a new semaphore based on the maximum number of tickets that tracks the tickets per worker (needs to be unique per process or thread -probably parent pid or pid_threadid). This would be the "quota semaphore", or "worker quota semaphore", tracking the number of worker tickets that have been issued. As new, unique, workers are added, we decrement this value. As they are removed, increment it.
  • The difference between the quota semaphore and the configured global maximum is the number of workers participating in the quota. This allows us to keep track of W.
  • As we update the quota semaphore, we can dynamically update the number of available RPC tickets (T), in order to maintain the desired quota (Q).

A nuance to this is:

When workers unregister themselves (they're killed or stop and SEM_UNDO does its thing), the worker count needs to be adjusted by something. We can cache the worker count in a semaphore in the semaphore set for the resource. On #acquire if it's different, we call update_ticket_count. It seems better for this reason to do it at #acquire time rather than #register time.

Support for error percentage?

Hi guys, love the library, but I wanted to see if you would be open/interested in adding support for error percent in addition to the current absolute error threshold? I find dealing with error rates in terms of percentages much more flexible than absolute values. If you're open to it I would be interested in trying to tackle it and send you a pull request. However, I don't want to maintain a fork of the project, so only want to go down this path if it is likely to get integrated back into the main code line.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.