Giter VIP home page Giter VIP logo

Comments (7)

vkuzmin-uber avatar vkuzmin-uber commented on May 18, 2024

If need to confirm that this is a problem of "propagation exceptions between threads", I can debug it. Just thought that someone may be aware of it.

from neuropod.

vkuzmin-uber avatar vkuzmin-uber commented on May 18, 2024

I think I see the code that causes it:

read_worker_(&IPCMessageQueue<UserPayloadType>::read_worker_loop, this)

  void IPCMessageQueue<UserPayloadType>::read_worker_loop()
...
          bool         successful_read =
              recv_queue_->timed_receive(received.get(), sizeof(WireFormat), received_size, priority, timeout_at);

          if (!successful_read)
          {
              // We timed out
              NEUROPOD_ERROR("Timed out waiting for a response from worker process. "
                             "Didn't receive a message in {}ms, but expected a heartbeat every {}ms.",
                             detail::MESSAGE_TIMEOUT_MS,
                             detail::HEARTBEAT_INTERVAL_MS);
          }

As result the exception is sent to read_worker_ thread. I think that instead it should put EXCEPTION message into

              // This is a user-handled message
              out_queue_.emplace(std::move(received));

and this way caller thread will detect it and throw. Let me know if it is correct and I can deal with PR and re-test it.

from neuropod.

VivekPanyam avatar VivekPanyam commented on May 18, 2024

So when running with OPE, we do indeed propagate exceptions from the worker process:

catch (const std::exception &e)
{
// Send the exception info back to the main process
std::string msg = e.what();
control_channel.send_message(EXCEPTION, msg);
}
catch (...)
{
control_channel.send_message(EXCEPTION, "An unknown exception occurred during inference");
}

Timeouts are handled slightly differently as you've noticed

if (!successful_read)
{
// We timed out
NEUROPOD_ERROR("Timed out waiting for a response from worker process. "
"Didn't receive a message in {}ms, but expected a heartbeat every {}ms.",
detail::MESSAGE_TIMEOUT_MS,
detail::HEARTBEAT_INTERVAL_MS);
}

Note that this gets thrown if we don't have any message available within the timeout; not just heartbeats. This usually happens when the worker process segfaults or crashes in a way that doesn't trigger the try/catch above

If we don't handle timeouts correctly in the message reading thread, it could lead to a deadlock when sending new messages. For example, if the main process is trying to send a message to the worker process and the queue is full, it'll block until there's a spot freed up. However, no progress will be made if the worker process isn't alive and the main thread will block forever. There are solutions to this that don't involve throwing an exception on another thread, but unfortunately it isn't as straightforward as just treating a timeout as another exception.

I think we may be able to modify the message sending logic to handle the deadlock case when queues are full. This should let us remove the NEUROPOD_ERROR on the message reading thread while also not impacting performance.

Can you consistently reproduce the timeout? As I mentioned above, it's not necessarily that there's so much load that the heartbeat can't be sent in time. It happens when no message has been received in 5000ms.

from neuropod.

vkuzmin-uber avatar vkuzmin-uber commented on May 18, 2024

I found some way to reproduce it, interesting that this is related to how client sends it - high load, 20K messages, with 4 OPE instances and 4 concurrent client threads I can reproduce it on my machine. But with 4 OPE instances and 1, 2 and 8 concurrent client threads - no timeout. I don't understand the reason yet and process "termination" doesn't allow to see if neuropod can still serve or some deadlock happened.

I can try to fix my local copy of neuropod and see if it can perform next inference if process is not terminated.

from neuropod.

vkuzmin-uber avatar vkuzmin-uber commented on May 18, 2024

I found that this is because Master process gets SEGV error first and then worker throws exception because of timeout.

#397

from neuropod.

vkuzmin-uber avatar vkuzmin-uber commented on May 18, 2024

@VivekPanyam

Last time we found BUG in neuropod and this wasn't addressed. This is becoming more critical for us since we are moving from Containerized solution to service with multiple models in OPE mode.

We experienced cases when OPE worked died because of:

  • Incompatible backend, tried to load Torchscript 1.7 model at Torchscript 1.1 backend. This was related to "rollback" to old version that had old backend.
  • OOM killer: Containerized app with "quota". If worker process reaches memory restriction (huge model, under high load), OOM killer kills worker process, service crashes after that because of this Issue.

In both cases, service could do "smart" decision if stay running. It makes sense to allow Unload if model was loaded successfully once. Neuropod Core may close IPC objects and if Worker isn't dead really, it will wake, timeout, release resources and exit too. We may even consider allowing core to send KILL signal to worker.

What do you think? Let us know if you need help with fix.

from neuropod.

VivekPanyam avatar VivekPanyam commented on May 18, 2024

Last time we found BUG in neuropod and this wasn't addressed. This is becoming more critical for us since we are moving from Containerized solution to service with multiple models in OPE mode.

Based on your previous comment, we transitioned focus to #397, but looks like we never resolved this one!

We experienced cases when OPE worked died because of:

  • Incompatible backend, tried to load Torchscript 1.7 model at Torchscript 1.1 backend. This was related to "rollback" to old version that had old backend.
  • OOM killer: Containerized app with "quota". If worker process reaches memory restriction (huge model, under high load), OOM killer kills worker process, service crashes after that because of this Issue.

In both cases, service could do "smart" decision if stay running. It makes sense to allow Unload if model was loaded successfully once. Neuropod Core may close IPC objects and if Worker isn't dead really, it will wake, timeout, release resources and exit too. We may even consider allowing core to send KILL signal to worker.

What do you think? Let us know if you need help with fix.

Makes sense, I'll take another look at this. As I said above, we need to be careful about deadlocks. I'll post another update here once I spend some more time on this today

from neuropod.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.