Giter VIP home page Giter VIP logo

cache-net-timeout's Introduction

README

Seems like there is a bug in Ruby implementation, when combining timeout.rb with tempfile

Sample rails application to reproduce a strange behaviour with Net::HTTP in combination with Tmpfile (used in Rails file store cache), leading to:

ERROR -- : Net::OpenTimeout:execution expired
ERROR -- : /home/inecas/.rbenv/versions/2.5.1/lib/ruby/2.5.0/timeout.rb:104:in `timeout'
/home/inecas/.rbenv/versions/2.5.1/lib/ruby/2.5.0/net/http.rb:935:in `connect'
/home/inecas/.rbenv/versions/2.5.1/lib/ruby/2.5.0/net/http.rb:920:in `do_start'
/home/inecas/.rbenv/versions/2.5.1/lib/ruby/2.5.0/net/http.rb:909:in `start'
/home/inecas/.rbenv/versions/2.5.1/lib/ruby/2.5.0/net/http.rb:609:in `start'
/home/inecas/.rbenv/versions/2.5.1/lib/ruby/2.5.0/net/http.rb:485:in `get_response'

It has been reproduced on ruby 2.5.1 from rbenv, as well as Satellite 6.5 SCL.

Usage:

To see the failure, run

./fails.sh

Compare it to the version that works:

./works.sh

Running in SCL

When reproducing on Satellite 6.5 instance, you need to run the scripts in tfm scl + remove the Gemfile.lock:

scl enable tfm bash
rm Gemfile.lock
./fails.sh

Analysis

strace output has bee captured with strace -f bash ./fails.sh 2> >(tee -a strace-fail.log) and log file is available in this repository.

The interesting part of the strace-fail.log starts at line 160081:

[pid 20834] unlink("/home/inecas/tmp/cache-net-timeout/tmp/cache/.hello20190309-20191-i43a3h" <unfinished ...>
[pid 20191] futex(0x55d4cd95e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 20834] <... unlink resumed> )      = -1 ENOENT (No such file or directory)
[pid 20191] <... futex resumed> )       = 0
[pid 20834] getpid( <unfinished ...>
[pid 20191] fstat(8,  <unfinished ...>
[pid 20834] <... getpid resumed> )      = 20191
[pid 20191] <... fstat resumed> {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
[pid 20834] getpid()                    = 20191
[pid 20191] getpid( <unfinished ...>
[pid 20834] write(6, "!", 1 <unfinished ...>
[pid 20191] <... getpid resumed> )      = 20191
[pid 20834] <... write resumed> )       = 1
[pid 20191] getpid( <unfinished ...>
[pid 20834] futex(0x55d4cd95ea4c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 20191] <... getpid resumed> )      = 20191
[pid 20191] write(4, "!", 1 <unfinished ...>
[pid 20250] <... poll resumed> )        = 1 ([{fd=3, revents=POLLIN}])
[pid 20191] <... write resumed> )       = 1
[pid 20250] read(3,  <unfinished ...>
[pid 20191] getpid( <unfinished ...>
[pid 20250] <... read resumed> "!", 1024) = 1
[pid 20191] <... getpid resumed> )      = 20191
[pid 20250] read(3,  <unfinished ...>
[pid 20191] tgkill(20191, 20834, SIGVTALRM <unfinished ...>

It shows that the OS is switching between two processes, one trying to delete file (20834) while the other (20191) is setting up sleep.

The sleeps comes from timeout.rb, that is covering the OpenTimeout of Net::HTTP. The unlink is originated in tmpfile.rb.

When I modified the timeout.rb to give me more details, it turned out that it has not been sleeping for 60 seconds, but the sleep returned immediately:

y = Thread.start {
  Thread.current.name = from
  begin
    ret = sleep sec
    puts "===== sleep thread for #{sec} seconds, slept #{ret} instead"
  rescue => e
    x.raise e
  else
    x.raise exception, message
  end
}
return yield(sec)

# Produced
# ===== sleep thread for 60 seconds, slept 0 instead

This caused the Net::OpenTimeout to be raised. Once the timeouts are actually hitting, the whole new set of possibilities for additional bugs can occur, such as process hanging while cleaning the Tempfile, as our friends from ManageIQ have observed some time ago.

The important fact probably also is, that the tempfile code is run as part of ObjectSpace finalizer:

begin
  File.unlink(@tmpfile.path)
rescue Errno::ENOENT
end

So it looks like in the worse case scenario, the tempfile and timeout threads can influence each other in very unfortunate way, leading to very strange behavior.

Mitigation

Making sure the tmpfiles are not used frequently (chaning from Rails.cache.write to Rails.cache.fetch) made the issue go away. However, in case your code is using tmpfiles in combination with timeouts (might be hidden, like here with Rails.cache and Net::HTTP), you might be hitting this issue in other scenarios.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.