Bug I am testing inference with 4 OPE instances. I am using C++ li

I think I see the code that causes it: <div class="snippet-clipboard-content notra

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Recovery from: "libc++abi.dylib: terminating with uncaught exception of type std::runtime_error" about neuropod HOT 7 CLOSED

uber commented on May 18, 2024

Recovery from: "libc++abi.dylib: terminating with uncaught exception of type std::runtime_error"

from neuropod.

Comments (7)

vkuzmin-uber commented on May 18, 2024

If need to confirm that this is a problem of "propagation exceptions between threads", I can debug it. Just thought that someone may be aware of it.

from neuropod.

vkuzmin-uber commented on May 18, 2024

I think I see the code that causes it:

read_worker_(&IPCMessageQueue<UserPayloadType>::read_worker_loop, this)

  void IPCMessageQueue<UserPayloadType>::read_worker_loop()
...
          bool         successful_read =
              recv_queue_->timed_receive(received.get(), sizeof(WireFormat), received_size, priority, timeout_at);

          if (!successful_read)
          {
              // We timed out
              NEUROPOD_ERROR("Timed out waiting for a response from worker process. "
                             "Didn't receive a message in {}ms, but expected a heartbeat every {}ms.",
                             detail::MESSAGE_TIMEOUT_MS,
                             detail::HEARTBEAT_INTERVAL_MS);
          }

As result the exception is sent to read_worker_ thread. I think that instead it should put EXCEPTION message into

              // This is a user-handled message
              out_queue_.emplace(std::move(received));

and this way caller thread will detect it and throw. Let me know if it is correct and I can deal with PR and re-test it.

from neuropod.

VivekPanyam commented on May 18, 2024

So when running with OPE, we do indeed propagate exceptions from the worker process:

neuropod/source/neuropod/multiprocess/multiprocess_worker.cc

Lines 122 to 131 in 895fcbe

 catch (const std::exception &e) 

 { 

 // Send the exception info back to the main process 

 std::string msg = e.what(); 

 control_channel.send_message(EXCEPTION, msg); 

 } 

 catch (...) 

 { 

 control_channel.send_message(EXCEPTION, "An unknown exception occurred during inference"); 

 }

Timeouts are handled slightly differently as you've noticed

neuropod/source/neuropod/multiprocess/mq/ipc_message_queue_impl.hh

Lines 76 to 83 in 895fcbe

 if (!successful_read) 

 { 

 // We timed out 

 NEUROPOD_ERROR("Timed out waiting for a response from worker process. " 

 "Didn't receive a message in {}ms, but expected a heartbeat every {}ms.", 

 detail::MESSAGE_TIMEOUT_MS, 

 detail::HEARTBEAT_INTERVAL_MS); 

 }

Note that this gets thrown if we don't have any message available within the timeout; not just heartbeats. This usually happens when the worker process segfaults or crashes in a way that doesn't trigger the try/catch above

If we don't handle timeouts correctly in the message reading thread, it could lead to a deadlock when sending new messages. For example, if the main process is trying to send a message to the worker process and the queue is full, it'll block until there's a spot freed up. However, no progress will be made if the worker process isn't alive and the main thread will block forever. There are solutions to this that don't involve throwing an exception on another thread, but unfortunately it isn't as straightforward as just treating a timeout as another exception.

I think we may be able to modify the message sending logic to handle the deadlock case when queues are full. This should let us remove the NEUROPOD_ERROR on the message reading thread while also not impacting performance.

Can you consistently reproduce the timeout? As I mentioned above, it's not necessarily that there's so much load that the heartbeat can't be sent in time. It happens when no message has been received in 5000ms.

from neuropod.

vkuzmin-uber commented on May 18, 2024

I found some way to reproduce it, interesting that this is related to how client sends it - high load, 20K messages, with 4 OPE instances and 4 concurrent client threads I can reproduce it on my machine. But with 4 OPE instances and 1, 2 and 8 concurrent client threads - no timeout. I don't understand the reason yet and process "termination" doesn't allow to see if neuropod can still serve or some deadlock happened.

I can try to fix my local copy of neuropod and see if it can perform next inference if process is not terminated.

from neuropod.

vkuzmin-uber commented on May 18, 2024

I found that this is because Master process gets SEGV error first and then worker throws exception because of timeout.

#397

from neuropod.

vkuzmin-uber commented on May 18, 2024

@VivekPanyam

Last time we found BUG in neuropod and this wasn't addressed. This is becoming more critical for us since we are moving from Containerized solution to service with multiple models in OPE mode.

We experienced cases when OPE worked died because of:

Incompatible backend, tried to load Torchscript 1.7 model at Torchscript 1.1 backend. This was related to "rollback" to old version that had old backend.
OOM killer: Containerized app with "quota". If worker process reaches memory restriction (huge model, under high load), OOM killer kills worker process, service crashes after that because of this Issue.

In both cases, service could do "smart" decision if stay running. It makes sense to allow Unload if model was loaded successfully once. Neuropod Core may close IPC objects and if Worker isn't dead really, it will wake, timeout, release resources and exit too. We may even consider allowing core to send KILL signal to worker.

What do you think? Let us know if you need help with fix.

from neuropod.

VivekPanyam commented on May 18, 2024

Last time we found BUG in neuropod and this wasn't addressed. This is becoming more critical for us since we are moving from Containerized solution to service with multiple models in OPE mode.

Based on your previous comment, we transitioned focus to #397, but looks like we never resolved this one!

We experienced cases when OPE worked died because of:

Incompatible backend, tried to load Torchscript 1.7 model at Torchscript 1.1 backend. This was related to "rollback" to old version that had old backend.

OOM killer: Containerized app with "quota". If worker process reaches memory restriction (huge model, under high load), OOM killer kills worker process, service crashes after that because of this Issue.

In both cases, service could do "smart" decision if stay running. It makes sense to allow Unload if model was loaded successfully once. Neuropod Core may close IPC objects and if Worker isn't dead really, it will wake, timeout, release resources and exit too. We may even consider allowing core to send KILL signal to worker.

What do you think? Let us know if you need help with fix.

Makes sense, I'll take another look at this. As I said above, we need to be careful about deadlocks. I'll post another update here once I spend some more time on this today

from neuropod.

Recovery from: "libc++abi.dylib: terminating with uncaught exception of type std::runtime_error" about neuropod HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	catch (const std::exception &e)
	{
	// Send the exception info back to the main process
	std::string msg = e.what();
	control_channel.send_message(EXCEPTION, msg);
	}
	catch (...)
	{
	control_channel.send_message(EXCEPTION, "An unknown exception occurred during inference");
	}

	if (!successful_read)
	{
	// We timed out
	NEUROPOD_ERROR("Timed out waiting for a response from worker process. "
	"Didn't receive a message in {}ms, but expected a heartbeat every {}ms.",
	detail::MESSAGE_TIMEOUT_MS,
	detail::HEARTBEAT_INTERVAL_MS);
	}