eapache / go-resiliency Goto Github PK

View Code? Open in Web Editor NEW

2.1K 28.0 145.0 74 KB

Resiliency patterns for golang

Home Page: https://godoc.org/github.com/eapache/go-resiliency

License: MIT License

Go 100.00%

resiliency go golang resiliency-patterns

go-resiliency's People

Contributors

Stargazers

Watchers

Forkers

hihus libliflin omeid scalingdata vonwenm winlinvip glkz nathanslaughtergs jmptrader houcy roger2000hk ateleshev robmurtha justvisiting h2non albertoleal linearregression todostreaming anirban-sap agend07 tomergolany kidtronnix whosyourban continuumllc gophersgang rasouli jessedearing dovys patricktoca ycaihua tanshungang dron22 sudersen yuanfeng0905 zhuquanji canv15 mailtruck oopsoutofmemory psimoesssimoes roscopecoltran 3dsinteractive forging2012 hohice www6v d7561985 weborama brainlabs aryanugroho sulthonzh tukejonny c1dx cubelite etherlabsio rgarcia joel-tkp satnami maniacs-ops neumachen aodubela arvancs awesomegolang seanlucif rajikaimal 5idu forkkit ismvarunsharma lihuawu 1046102779 yiduwangkai kelvinbenzali snedzad m-sters zhiruchen rizalgowandy gaoey giedriuss fstab tpaschalis kc596 genisysram dipjyotimetia ben-swit xielongdragon whiledoing dds klever-io kokizzu kyawmyintthein vtanrun lucasepe isgasho mikelevine thalesfsp gururajrkatti yabzhang huangmarco standardgalactic bboyrankingz j2deen baozixiaoxixi

go-resiliency's Issues

Please release again!

Missing a number of new fixes, including RunCtx :)

goroutine leak in Deadline.Run

if Run() returns ErrTimeOut, the work function would block on sending to channel result.

retrier: support context.Context

Greetings,
I'm a user of the retry package, mostly because it's a dependency of the shopify/sarama package - so it's already available in my vendor directory.

I'm interested to know if you'd be in favor of supporting the context package for retrier -- I imagine the interface to be

(r *Retrier) RunCtx(context.Context, func() error) error

This would allow the retrier to exit early if the provided context is cancelled prior to the retry wait period.
If the context is cancelled, it could return the error of the last failure, or ctx.Err() - I'm unsure which would be preferred.

I would imagine the retier struct would need to be extended to include a time.Timer that would be reset upon each call to Run/RunCtx, and each iteration of attempts.

Exiting early would me a simple matter of having a function such as

// sleep will return true if the time channel has a message
// sent on it before the context expires
//
// accepts a <-chan time.Time for more deterministic testing
// no need to rely on the runtime to fire the timer
func sleep(ctx context.Context, t <-chan time.Time) bool {
    select {
    case <-t:
        return true
    case <-ctx.Done():
        return false
    }
}

the call to time.Sleep() could then be replaced with

if !sleep(ctx, r.timer.C) {
    // return previous error?
    return ctx.Err()
}

The implementation of Run could then become

(r *Retrier) Run(fn func() error) error {
    return r.RunCtx(context.Background(), fn)
}

This approach is seen in the standard library, and is a nice way to maintain backwards compatibility.

Please let me know your thoughts - if this is something you'd approve of, I could likely find the time to do the implementation.

race condition error on batcher

Hello

we discover a possible race condition on batcher since version go-resiliency 1.4 and 1.5

Here is a minimal example with a test that trigger this issue

I tested with latest go (1.22) and go-resiliency 1.5

source code is available here https://github.com/peczenyj/example-race

package: https://github.com/peczenyj/example-race/blob/main/foo/foo.go

test: https://github.com/peczenyj/example-race/blob/main/foo/foo_test.go

$ go test -race ./foo/...
==================
WARNING: DATA RACE
Write at 0x00c00009eec8 by goroutine 7:
  github.com/eapache/go-resiliency/batcher.(*Batcher).submitWork()
      /home/weborama.office/tiago/work/go/src/github.com/peczenyj/example-race/vendor/github.com/eapache/go-resiliency/batcher/batcher.go:74 +0xed
  github.com/eapache/go-resiliency/batcher.(*Batcher).Run()
      /home/weborama.office/tiago/work/go/src/github.com/peczenyj/example-race/vendor/github.com/eapache/go-resiliency/batcher/batcher.go:55 +0x20e
  github.com/peczenyj/example-race/foo.(*Foo).Get()
      /home/weborama.office/tiago/work/go/src/github.com/peczenyj/example-race/foo/foo.go:36 +0x2e7
  github.com/peczenyj/example-race/foo_test.TestBatcher()
      /home/weborama.office/tiago/work/go/src/github.com/peczenyj/example-race/foo/foo_test.go:23 +0x2e8
  testing.tRunner()
      /usr/local/go1.22.0/src/testing/testing.go:1689 +0x21e
  testing.(*T).Run.gowrap1()
      /usr/local/go1.22.0/src/testing/testing.go:1742 +0x44

Previous read at 0x00c00009eec8 by goroutine 8:
  github.com/eapache/go-resiliency/batcher.(*Batcher).batch()
      /home/weborama.office/tiago/work/go/src/github.com/peczenyj/example-race/vendor/github.com/eapache/go-resiliency/batcher/batcher.go:100 +0x3d8
  github.com/eapache/go-resiliency/batcher.(*Batcher).submitWork.gowrap2()
      /home/weborama.office/tiago/work/go/src/github.com/peczenyj/example-race/vendor/github.com/eapache/go-resiliency/batcher/batcher.go:76 +0x33

Goroutine 7 (running) created at:
  testing.(*T).Run()
      /usr/local/go1.22.0/src/testing/testing.go:1742 +0x825
  testing.runTests.func1()
      /usr/local/go1.22.0/src/testing/testing.go:2161 +0x85
  testing.tRunner()
      /usr/local/go1.22.0/src/testing/testing.go:1689 +0x21e
  testing.runTests()
      /usr/local/go1.22.0/src/testing/testing.go:2159 +0x8be
  testing.(*M).Run()
      /usr/local/go1.22.0/src/testing/testing.go:2027 +0xf17
  main.main()
      _testmain.go:49 +0x2bd

Goroutine 8 (finished) created at:
  github.com/eapache/go-resiliency/batcher.(*Batcher).submitWork()
      /home/weborama.office/tiago/work/go/src/github.com/peczenyj/example-race/vendor/github.com/eapache/go-resiliency/batcher/batcher.go:76 +0x1ef
  github.com/eapache/go-resiliency/batcher.(*Batcher).Run()
      /home/weborama.office/tiago/work/go/src/github.com/peczenyj/example-race/vendor/github.com/eapache/go-resiliency/batcher/batcher.go:55 +0x20e
  github.com/peczenyj/example-race/foo.(*Foo).Get()
      /home/weborama.office/tiago/work/go/src/github.com/peczenyj/example-race/foo/foo.go:36 +0x283
  github.com/peczenyj/example-race/foo_test.TestBatcher()
      /home/weborama.office/tiago/work/go/src/github.com/peczenyj/example-race/foo/foo_test.go:22 +0x284
  testing.tRunner()
      /usr/local/go1.22.0/src/testing/testing.go:1689 +0x21e
  testing.(*T).Run.gowrap1()
      /usr/local/go1.22.0/src/testing/testing.go:1742 +0x44
==================
--- FAIL: TestBatcher (2.01s)
    testing.go:1398: race detected during execution of test
FAIL
FAIL	github.com/peczenyj/example-race/foo	2.017s
FAIL

while using go-resiliency 1.3 there is no race

$ go test -race ./foo/...
ok  	github.com/peczenyj/example-race/foo	3.016s

regards

Please release again

The sole current release is missing years of fixes

Retry / Backoff

As the title says, it might be handy to have a package for retriable actions with different back-off strategies etc.

please tag and version this project

Hello,

Can you please tag and version this project?

I am the Debian Maintainer for go-resiliency and versioning would help Debian keep up with development.

Error unwrapping for `retrier/`

Hello, I've some feature request regarding the title

Basically, I propose we use errors.Is() from standard library to check for the blacklisted/whitelisted error types (the err == pass bit)

go-resiliency/retrier/classifier.go

Lines 40 to 44 in b98ce28

 for _, pass := range list { 

 if err == pass { 

 return Retry 

 } 

 }

The reason is that currently we need to specify exacly what the error kind is, without checking if an error actually contains ("wraps") another error which are actually listed

Say we've implements a custom type to better handle specific error usecase

type HTTPError struct {
    error
    host string
    path string
}

func (h HTTPError) Error() string { // implements Go's `error` type
    return "failed on HTTP request to " + h.host + h.path + " reason: " + h.error.Error()
}

func (h HTTPError) Unwrap() error { // will be called from standard library error unwrapping
    return h.error
}

And that the error type wraps context.Canceled, which is defined as a blacklist

backoff := // ...
class := retrier.BlacklistClassifier{
	context.Canceled,
}
retry := retrier.New(backoff, class)

// ...

retry.Run(func() {
    // ...
    return HTTPError{error: context.Cancelled, host: "...", path: "..."}
})

The blacklist check would fail, because err == pass won't know what the "internal" value of the error actually is

clarify deadline example

The deadline example says:
// check stopper function and give up if timed out
What is the stopper function? Stopper is a channel. Perhaps the example can be expanded to show how to do this?

feature request: retrier: reset number of retries/backoff time

Imagine that I have an application that is an event processor that connects to some event log and for every event that arrives I want to do something with it. Also imagine that I have a Retrier with exponential backoff (100ms initial backoff time) and 5 retries.

We can get errors because of network issues (server not ready to accept connections, server disconnected for some reason like server LoadBalancer downscaling instances) or errors processing events.

If the app fails 5 times to process an event the retry loop should end and return error. This is good.
If my app fails to connect 5 times in a row the same should happen.

But if my app connects to the server, happily processes messages for 2hours then gets a server disconnection. It will retry the connection and (if successful) continue processing events. In the next few hours we receive 4 more server disconnections, but after each one we're able to reconnect and continue processing messages for some time. It does not make sense to say "we've failed 5 times so let's just finish here and return an error".

So basically the Retrier should expose a Reset() method which would cause the retrier to "forget" that it encountered errors before so that the next time that it encounters an error, instead of sleeping for a long time due to exponential backoff it will just sleep for the initial 100ms (as per the example) and it could tolerate 5 more errors later.

Here's a very simple skeleton of code that shows how this feature could be used:

r := retrier.New(retrier.ExponentialBackoff(5, 100 * time.Milliseconds), nil)
err := r.Run(func() error {
  server, err := connect(address)
  if err != nil {
    return err
  }

  for {
    event, err := server.getNextEvent()
    if err != nil {
      return err
    }
    err = processEvent(event)
    if err != nil {
      return err
    }
    // Because we've been able to process an event it seems that all is working correctly
    // so it doesn't matter that in the past we had (for example 2 errors), the next time we
    // see an error should be like it was the first error we've seen, and the backoff time to
    // to wait until next try should be the initial one passed by the user.
    r.Reset()
  }
}

Reset implementation in CircuitBreaker

From what I see in this repository, this circuit breaker does not have a function to reset itself, so that the treshold all resets but without creating new CircuitBreaker. This help so much when the circuit breaker is intended to do multiple breaker on diffrent function in one file.

Support service oriented calls for retrier

I like to use retrier to wrap calls to external services one thing that is generating a bit of verbose code right now is that the work functions do not return any value in their signature (totally understandable) .

The idea here would be to create a new Run/RunCtx signature using generics and make the work function return the type that is passed to the function...

func (r *Retrier) RunR[R any](work func() (R, error)) error

func (r *Retrier) RunCtxR[R any](ctx context.Context, work func(ctx context.Context) (R, error)) error

This way the caller can just wrap their normal function without having to var the return...

I understand this would force the library to jump all the way to go 1.18 but just wondering if you would be open to me PRing this..

I can also try to use some type of interface{} approach so that there is no need to go go 1.18

Request for New Tag: Add Infinite Retry Option

Hello,
I recently made a change to the library and I would like new tag to be so I can rely on it, in my project. The specific change is the introduction of a new option called "WithInfiniteRetry()" allowing users to configure the retrier for infinite retry attempts.

Change Details:

New option: WithInfiniteRetry()
Users can now set this option to enable infinite retries for scenarios where continuous retry attempts are desired.

Why This Matters:

Infinite retries can be useful in scenarios where a user wants to continuously attempt an operation until success without a predefined limit. This change enhances the flexibility and adaptability of the retrier, catering to a wider range of use cases.
IMHO, it's supposed to be an alternativ to handle in a more proper way a panicOnError, in a transactional context.

Request:

Can you please create a new tag for this enhancement, making it easier for users to integrate this feature into their projects. I understand and appreciate your time and effort in maintaining this library. 🙏

Thank you for considering this request. If there are any additional details or steps required from my end, please let me know.

Thanks a lot !

Exponential backoff should have some jitter

This minimizes the risk of correlated retries. Compare to goback's jitter: https://github.com/carlescere/goback#jitter-backoff

add log for breaker error

I feel that you should add log when some occur in function func (b *Breaker) doWork(state uint32, work func() error) error, it convenient for us to find error

breaker_test.go:173: circuit breaker is open

Hello!

We are getting the following error in the Debian CI:

--- FAIL: TestBreakerAsyncStateTransitions (4.03s)
	breaker_test.go:173: circuit breaker is open
FAIL
exit status 1
FAIL	gopkg.in/eapache/go-resiliency.v1/breaker	14.081s
=== RUN   TestDeadline
--- FAIL: TestDeadline (0.04s)
	deadline_test.go:27: timed out waiting for function to finish
FAIL
exit status 1
FAIL	gopkg.in/eapache/go-resiliency.v1/deadline	0.080s
=== RUN   TestConstantBackoff

full log: https://tests.reproducible-builds.org/debian/logs/buster/amd64/golang-gopkg-eapache-go-resiliency.v1_1.0.0-1.build2.log.gz

Looks like it only happens some of the time.

retrier: RunCtx is not immediately cancellable

When ctx passed to RunCtx(ctx, func(ctx ... is cancelled, the RunCtx method does not return early, instead it waits until the current backoff time passes.

This is caused by the following code:

go-resiliency/retrier/retrier.go

Lines 98 to 103 in b912f08

 case <-ctx.Done(): 

 if !timer.Stop() { 

 <-timer.C 

 } 

 return ctx.Err()

While researching the proper API usage I stumbled upon a blog post that proposes such construct.

In my opinion this blog post is a bit misleading, because it applies a solution for time.After to a time.NewTimer usage, which makes no sense.

So I used sourcegraph.com/search to search for some of the usages of the time.NewTimer which will confirm my above conclusion.

So I found the following NewTimer usage in the net/http package of the standard library:

	timer := time.NewTimer(nextPollInterval())
	defer timer.Stop()
	for {
		if srv.closeIdleConns() {
			return lnerr
		}
		select {
		case <-ctx.Done():
			return ctx.Err()
		case <-timer.C:
			timer.Reset(nextPollInterval())
		}
	}

https://github.com/golang/go/blob/b2dbfbfc2315557815e1d5de12f28ed57f60958a/src/net/http/server.go#L3008-L3020

As shown above, there is no need to wait on timer.C when ctx is cancelled, we just need to ensure the timer is stopped when it goes out of scope.

So combining all the above, the issue I am reporting (that RunCtx does not return immadiately when ctx is cancel) could be fixed by the following patch:

diff --git a/retrier/retrier.go b/retrier/retrier.go
index e1b4349..9914a29 100644
--- a/retrier/retrier.go
+++ b/retrier/retrier.go
@@ -69,6 +69,7 @@ func (r *Retrier) RunCtx(ctx context.Context, work func(ctx context.Context) err
 // is returned to the caller regardless. The work function takes 2 args, the context and
 // the number of attempted retries.
 func (r *Retrier) RunFn(ctx context.Context, work func(ctx context.Context, retries int) error) error {
+       var timer *time.Timer
        retries := 0
        for {
                ret := work(ctx, retries)
@@ -81,7 +82,11 @@ func (r *Retrier) RunFn(ctx context.Context, work func(ctx context.Context, retr
                                return ret
                        }

-                       timer := time.NewTimer(r.calcSleep(retries))
+                       if d := r.calcSleep(retries); timer == nil {
+                               timer = time.NewTimer(d)
+                       } else {
+                               timer.Reset(d)
+                       }
                        if err := r.sleep(ctx, timer); err != nil {
                                return err
                        }
@@ -96,10 +101,7 @@ func (r *Retrier) sleep(ctx context.Context, timer *time.Timer) error {
        case <-timer.C:
                return nil
        case <-ctx.Done():
-               if !timer.Stop() {
-                       <-timer.C
-               }
-
+               timer.Stop()
                return ctx.Err()
        }
 }

@eapache If you agree with the above reasoning, would you want me to send the above as PR?

Potential memory leak

Usage of <-time.After(c.Timeout) here may lead to memory leak.

timeout := time.After(r.calcSleep(retries))

According to the official time package docs:

After waits for the duration to elapse and then sends the current time on the returned channel. It is equivalent to NewTimer(d).C. The underlying Timer is not recovered by the garbage collector until the timer fires. If efficiency is a concern, use NewTimer instead and call Timer.Stop if the timer is no longer needed.

Possible solution:

func (r *Retrier) RunCtx(ctx context.Context, work func(ctx context.Context) error) error {
	retries := 0
	for {
		ret := work(ctx)

		switch r.class.Classify(ret) {
		case Succeed, Fail:
			return ret
		case Retry:
			if retries >= len(r.backoff) {
				return ret
			}

			timer := time.NewTimer(r.calcSleep(retries))
			if err := r.sleep(ctx, timer); err != nil {
				return err
			}

			retries++
		}
	}
}

func (r *Retrier) sleep(ctx context.Context, timer *time.Timer) error {
	select {
	case <-timer.C:
		return nil
	case <-ctx.Done():
		if !timer.Stop() {
			<-timer.C
		}

		return ctx.Err()
	}
}

References:

	case <-ctx.Done():
	if !timer.Stop() {
	<-timer.C
	}

	return ctx.Err()