Giter VIP home page Giter VIP logo

Comments (16)

wch avatar wch commented on August 10, 2024

Do you know when it started happening? We released httpuv 1.6.7 on 2022-12-14, and previously we released 1.6.6 on 2022-09-08.

In 1.6.7, the only real change was a new implementation of timegm, and I doubt that has anything to do with the crash you're seeing.

In 1.6.6, we updated to libuv 1.43.0, and that conceivably could have changed the behavior.

Just looking at those stack traces, it appears that the crash is happening when the background thread is started, which happens the first time an httpuv server is started within an R session. If that is correct, I would expect code like this to be able to reproduce the crash:

for i in {1..100}; do
    Rscript -e "s <- httpuv::startServer('0.0.0.0', 8000, list()); cat('.')"
done

However, when I run it on my machine, I don't get any errors (with R 4.1.2, httpuv 1.6.7, on an M1 Mac). I see that the GHA runners use R 4.2.2 and x86_64, however, and it's possible that those differences matter.

If you are able to reliably reproduce the problem, that would really help for debugging it.

from httpuv.

jeroen avatar jeroen commented on August 10, 2024

I am sorry this is so difficult to reproduce, it happens very randomly.

However, when I run it on my machine, I don't get any errors (with R 4.1.2, httpuv 1.6.7, on an M1 Mac). I see that the GHA runners use R 4.2.2 and x86_64, however, and it's possible that those differences matter.

I have not been able to make it crash locally either but it keeps happening in my actions. Here is a GitHub action run with your test example from above and indeed it crashes, but not reliably: https://github.com/jeroen/httpuvtest/actions/runs/3883434668/jobs/6624740869

I hope this helps. For some reason this Rscript action does not generate MacOS crashreports in the same way that R CMD check does, I don't understand why. Therefore I haven't been able to get the backtrace, but I'm pretty sure it's the same issue. Let me know if you figure out how to do this.

from httpuv.

jeroen avatar jeroen commented on August 10, 2024

Actually your example seems to crash on GitHub actions on all versions of httpuv that I tested back to 1.6.0. So perhaps this is a long standing bug, and something has changed in GitHub actions that now triggers it.

from httpuv.

wch avatar wch commented on August 10, 2024

Just curious: what platforms did you use when testing locally?

Also, I suspect that the crashes are more likely when the system is slowed down by a heavy computational load, so it may "help" to run lots of other stuff in the background.

from httpuv.

jeroen avatar jeroen commented on August 10, 2024

I tested on my MacOS 11.7 (arm) MacOS 13.1 (intel), and in vmware MacOS 10.13.

from httpuv.

wch avatar wch commented on August 10, 2024

I tested on the following but wasn't able to reproduce the problem:

  • Intel, macOS 12.6, R 4.1.1, httpuv 1.6.7 (CRAN binary build)
  • M1, macOS 12.5.1, R 4.1.2, httpuv 1.6.7 (CRAN binary build)

On the Intel machine, I also ran yes > /dev/null & eight times to increase the CPU load, but the crash still didn't happen.

from httpuv.

wch avatar wch commented on August 10, 2024

Oh wait, I think I was just able to reproduce this on the M1 machine, by running the yes command many times before running the test script.

[winston@MBP16:~]$ yes > /dev/null &
[1] 9698

[winston@MBP16:~]$ yes > /dev/null &
[2] 9699

[winston@MBP16:~]$ yes > /dev/null &
[3] 9700

[winston@MBP16:~]$ yes > /dev/null &
[4] 9701

[winston@MBP16:~]$ yes > /dev/null &
[5] 9702

[winston@MBP16:~]$ yes > /dev/null &
[6] 9703

[winston@MBP16:~]$ yes > /dev/null &
[7] 9704

[winston@MBP16:~]$ yes > /dev/null &
[8] 9705

[winston@MBP16:~]$ yes > /dev/null &
[9] 9706

[winston@MBP16:~]$ yes > /dev/null &
[10] 9707

[winston@MBP16:~]$ yes > /dev/null &
[11] 9708

[winston@MBP16:~]$ yes > /dev/null &
[12] 9709

[winston@MBP16:~]$ yes > /dev/null &
[13] 9710

[winston@MBP16:~]$ yes > /dev/null &
[14] 9711

[winston@MBP16:~]$ yes > /dev/null &
[15] 9712

[winston@MBP16:~]$ for i in {1..100}; do
>     Rscript -e "s <- httpuv::startServer('0.0.0.0', 8000, list()); cat('.')"
> done
..........Abort trap: 6
....................................................................................Abort trap: 6
....

from httpuv.

wch avatar wch commented on August 10, 2024

I modified the code in libuv's thread.c to print out the error code from pthread_mutex_unlock.

The error code is EINVAL, which means that the mutex was not initialized.

from httpuv.

wch avatar wch commented on August 10, 2024

I think I understand what's happening.

The main thread calls ensure_io_thread(), which does a few things.

  • It creates a Barrier object (named blocker), which contains a CondWait object, which in turn contains a uv_mutex_t.
  • It also starts up the background thread, and on that thread it calls io_thread().

Both the main thread and background thread call blocker.wait() at some point; after both have called it, then both are allowed to continue.

The problem is that ensure_io_thread() (on the main thread) finishes and returns before blocker->wait() on the background thread returns. When ensure_io_thread() finishes, it triggers the destructor for the Barrier object, which in turns triggers the destructor for its CondWait object, which in turn calls uv_mutex_destroy(). That happens before the background thread returns from Barrier.wait(); when it does return from that function, it calls the destructor on a guard object (which was created with the mutex), and that calls uv_mutex_unlock. But the mutex was already destroyed, and so it results in an error.

Relevant code:

ensure_io_thread():

httpuv/src/httpuv.cpp

Lines 148 to 162 in a1211df

void ensure_io_thread() {
ASSERT_MAIN_THREAD()
if (io_thread_running.get()) {
return;
}
Barrier blocker(2);
int ret = uv_thread_create(&io_thread_id, io_thread, &blocker);
// Wait for io_loop to be initialized before continuing
blocker.wait();
if (ret != 0) {
Rcpp::stop(std::string("Error: ") + uv_strerror(ret));
}
}

io_thread():

httpuv/src/httpuv.cpp

Lines 112 to 127 in a1211df

void io_thread(void* data) {
register_background_thread();
Barrier* blocker = reinterpret_cast<Barrier*>(data);
io_thread_running.set(true);
io_loop.ensure_initialized();
background_queue = new CallbackQueue(io_loop.get());
// Set up async communication channels
uv_async_init(io_loop.get(), &async_stop_io_loop, stop_io_loop);
// Tell other thread that it can continue.
blocker->wait();

The guard, CondWait, and Barrier` classes:

httpuv/src/thread.h

Lines 30 to 130 in a1211df

class guard {
public:
guard(uv_mutex_t &mutex) {
_mutex = &mutex;
uv_mutex_lock(_mutex);
}
~guard() {
uv_mutex_unlock(_mutex);
}
private:
uv_mutex_t* _mutex;
};
// Container class for holding thread-safe values.
template <typename T>
class ThreadSafe {
public:
ThreadSafe(const T value) : _value(value) {
uv_mutex_init(&_mutex);
};
void set(const T value) {
guard guard(_mutex);
_value = value;
};
T get() {
guard guard(_mutex);
return _value;
};
private:
T _value;
uv_mutex_t _mutex;
};
// Wrapper class for mutex/condition variable pair.
class CondWait {
public:
CondWait() {
uv_mutex_init(&mutex);
uv_cond_init(&cond);
};
~CondWait() {
uv_mutex_destroy(&mutex);
uv_cond_destroy(&cond);
}
void lock() {
uv_mutex_lock(&mutex);
}
void unlock() {
uv_mutex_unlock(&mutex);
}
void signal() {
uv_cond_signal(&cond);
};
void wait() {
uv_cond_wait(&cond, &mutex);
}
uv_mutex_t mutex;
uv_cond_t cond;
};
// uv_barrier_t causes crashes on Windows (with libuv 1.15.0), so we have our
// own implementation here.
class Barrier {
public:
Barrier(int n) : n(n) {}
void wait() {
guard guard(condwait.mutex);
if (n == 0) {
return;
}
--n;
if (n == 0) {
condwait.signal();
}
while(n > 0) {
condwait.wait();
}
}
private:
int n;
CondWait condwait;
};

from httpuv.

jcheng5 avatar jcheng5 commented on August 10, 2024

What are you thinking for a fix? shared_ptr?

Also, a comment on the Barrier class says "uv_barrier_t causes crashes on Windows (with libuv 1.15.0), so we have our own implementation here." Is it possible that crash was caused by a similar race condition?

from httpuv.

wch avatar wch commented on August 10, 2024

The lifetime of the object used for synchronization will need to to be longer than the ensure_io_thread() call; the current way that it's implemented, where it gets allocated on the stack within ensure_io_thread(), isn't viable.

I don't recall the details of the crashes on Windows, unfortunately.

from httpuv.

wch avatar wch commented on August 10, 2024

Just a quick summary of how to reproduce the error:

for i in {1..16}; do
    yes > /dev/null &
done

for i in {1..50}; do
    Rscript -e "s <- httpuv::startServer('0.0.0.0', 8000, list()); cat('.')"
done

killall yes

from httpuv.

wch avatar wch commented on August 10, 2024

@jeroen I have a potential fix in #352. Can you try it out in GHA with your test repository?

from httpuv.

jeroen avatar jeroen commented on August 10, 2024

Thanks, looks like this fixes the problem.

from httpuv.

jeroen avatar jeroen commented on August 10, 2024

Thanks for getting this diagnosed and fixed so quickly.

from httpuv.

wch avatar wch commented on August 10, 2024

No problem, I'm glad you found it.

from httpuv.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.