Comments (7)
If need to confirm that this is a problem of "propagation exceptions between threads", I can debug it. Just thought that someone may be aware of it.
from neuropod.
I think I see the code that causes it:
read_worker_(&IPCMessageQueue<UserPayloadType>::read_worker_loop, this)
void IPCMessageQueue<UserPayloadType>::read_worker_loop()
...
bool successful_read =
recv_queue_->timed_receive(received.get(), sizeof(WireFormat), received_size, priority, timeout_at);
if (!successful_read)
{
// We timed out
NEUROPOD_ERROR("Timed out waiting for a response from worker process. "
"Didn't receive a message in {}ms, but expected a heartbeat every {}ms.",
detail::MESSAGE_TIMEOUT_MS,
detail::HEARTBEAT_INTERVAL_MS);
}
As result the exception is sent to read_worker_ thread. I think that instead it should put EXCEPTION message into
// This is a user-handled message
out_queue_.emplace(std::move(received));
and this way caller thread will detect it and throw. Let me know if it is correct and I can deal with PR and re-test it.
from neuropod.
So when running with OPE, we do indeed propagate exceptions from the worker process:
neuropod/source/neuropod/multiprocess/multiprocess_worker.cc
Lines 122 to 131 in 895fcbe
Timeouts are handled slightly differently as you've noticed
neuropod/source/neuropod/multiprocess/mq/ipc_message_queue_impl.hh
Lines 76 to 83 in 895fcbe
Note that this gets thrown if we don't have any message available within the timeout; not just heartbeats. This usually happens when the worker process segfaults or crashes in a way that doesn't trigger the try/catch
above
If we don't handle timeouts correctly in the message reading thread, it could lead to a deadlock when sending new messages. For example, if the main process is trying to send a message to the worker process and the queue is full, it'll block until there's a spot freed up. However, no progress will be made if the worker process isn't alive and the main thread will block forever. There are solutions to this that don't involve throwing an exception on another thread, but unfortunately it isn't as straightforward as just treating a timeout as another exception.
I think we may be able to modify the message sending logic to handle the deadlock case when queues are full. This should let us remove the NEUROPOD_ERROR
on the message reading thread while also not impacting performance.
Can you consistently reproduce the timeout? As I mentioned above, it's not necessarily that there's so much load that the heartbeat can't be sent in time. It happens when no message has been received in 5000ms.
from neuropod.
I found some way to reproduce it, interesting that this is related to how client sends it - high load, 20K messages, with 4 OPE instances and 4 concurrent client threads I can reproduce it on my machine. But with 4 OPE instances and 1, 2 and 8 concurrent client threads - no timeout. I don't understand the reason yet and process "termination" doesn't allow to see if neuropod can still serve or some deadlock happened.
I can try to fix my local copy of neuropod and see if it can perform next inference if process is not terminated.
from neuropod.
I found that this is because Master process gets SEGV error first and then worker throws exception because of timeout.
from neuropod.
Last time we found BUG in neuropod and this wasn't addressed. This is becoming more critical for us since we are moving from Containerized solution to service with multiple models in OPE mode.
We experienced cases when OPE worked died because of:
- Incompatible backend, tried to load Torchscript 1.7 model at Torchscript 1.1 backend. This was related to "rollback" to old version that had old backend.
- OOM killer: Containerized app with "quota". If worker process reaches memory restriction (huge model, under high load), OOM killer kills worker process, service crashes after that because of this Issue.
In both cases, service could do "smart" decision if stay running. It makes sense to allow Unload if model was loaded successfully once. Neuropod Core may close IPC objects and if Worker isn't dead really, it will wake, timeout, release resources and exit too. We may even consider allowing core to send KILL signal to worker.
What do you think? Let us know if you need help with fix.
from neuropod.
Last time we found BUG in neuropod and this wasn't addressed. This is becoming more critical for us since we are moving from Containerized solution to service with multiple models in OPE mode.
Based on your previous comment, we transitioned focus to #397, but looks like we never resolved this one!
We experienced cases when OPE worked died because of:
- Incompatible backend, tried to load Torchscript 1.7 model at Torchscript 1.1 backend. This was related to "rollback" to old version that had old backend.
- OOM killer: Containerized app with "quota". If worker process reaches memory restriction (huge model, under high load), OOM killer kills worker process, service crashes after that because of this Issue.
In both cases, service could do "smart" decision if stay running. It makes sense to allow Unload if model was loaded successfully once. Neuropod Core may close IPC objects and if Worker isn't dead really, it will wake, timeout, release resources and exit too. We may even consider allowing core to send KILL signal to worker.
What do you think? Let us know if you need help with fix.
Makes sense, I'll take another look at this. As I said above, we need to be careful about deadlocks. I'll post another update here once I spend some more time on this today
from neuropod.
Related Issues (20)
- Python built-in libraries can't be found when bundled python model and external environment have different versions HOT 1
- Tensorflow backend in IPE mode reports warning "expects data to be 64 byte aligned" on every inference
- Is nueropod designed to support tf.Example or sparse tensor ? HOT 3
- Add support for tf.SavedModel format HOT 7
- [TF] Add tests for saved model support for TF 1.x
- [Notice] Intent to drop support for Python 2.7 + Torch 1.1.0 + TF 1.12.0
- Installation of the torch package to isolated python environment HOT 2
- create_keras_neuropod() does not work with TF 2.6 (AttributeError: 'KerasTensor' object has no attribute 'graph') HOT 1
- Support Torchscript 1.9 in neuropod
- [Notice] Intent to drop support for Python 3.5 + Torch 1.2.0 + TF 1.13.1
- Support for newer versions of Torchscript HOT 14
- Loose the Neuropod backend version match HOT 2
- Support Dict[str, Union[List[str], torch.Tensor]] as torchscript model input type HOT 2
- SavedModel Load Issue HOT 2
- User provided `node_name_mapping` is not used with TF SavedModels
- Slicing is not done correctly for torch script model HOT 6
- Things to figure out before a major release
- Have a way to set LD_PRELOAD for tensorflow_text HOT 2
- Unlicense dependency should not be included directly in the project HOT 2
- Is there a plan to support Python 3.9+
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from neuropod.