Comments (6)
@namannandan Thankyou ! I tried setting the value to 0 and i see it did not restart the model. The only confusing thing was when I query the model, i do not see it at that endpoint. It is part of model config though.
curl http://localhost:81/models/toy-ranker [ { "modelName": "toy-ranker", "modelVersion": "2024-03-21-15:57", "modelUrl": "toy-ranker.mar", "runtime": "python", "minWorkers": 4, "maxWorkers": 4, "batchSize": 1, "maxBatchDelay": 100, "loadedAtStartup": true, ......
But this resolves my question, and make sense to not use it in prod.
Some of the model configuration options are not shown in the describe model API response, for ex: maxRetryTimeoutInSec
but I'd like to confirm that including the configuration say maxRetryTimeoutInSec: 100
in your model-config.yaml
will apply the configuration to the model. Created a follow up issue to track update of describe model API response to include all model configuration values: #3037
from serve.
@harshita-meena maxRetryTimeoutInSec
is the duration for which the worker will be attempted to be restarted. The delay between restart attempts made within the maxRetryTimeoutInSec
follows backoff(in seconds) and is specified here:
Although, maxRetryTimeoutInSec
is set to 100
seconds, it is still possible that multiple attempts are made within that duration to restart the worker.
While debugging your handler, if you'd like for the worker to be attempted to be started only once and give up on failure you could set the maxRetryTimeoutInSec
to 0
. This way, no retry attempts will be made. Note that, this not recommended configurations for a production setting.
from serve.
@namannandan Thankyou ! I tried setting the value to 0 and i see it did not restart the model. The only confusing thing was when I query the model, i do not see it at that endpoint. It is part of model config though.
curl http://localhost:81/models/toy-ranker
[
{
"modelName": "toy-ranker",
"modelVersion": "2024-03-21-15:57",
"modelUrl": "toy-ranker.mar",
"runtime": "python",
"minWorkers": 4,
"maxWorkers": 4,
"batchSize": 1,
"maxBatchDelay": 100,
"loadedAtStartup": true,
......
But this resolves my question, and make sense to not use it in prod.
from serve.
@namannandan What will be your recommendation for specific cases of when to restart the server, for example a very basic error that is just related to syntax (eg : missed import for basehandler from ts.torch_handler.base_handler import BaseHandler
) might not need restart but I can see scenarios where a worker crashed due to an unknown reason where restart will lead to a healthy worker. Is there a way to control when to restart ?
from serve.
@namannandan What will be your recommendation for specific cases of when to restart the server, for example a very basic error that is just related to syntax (eg : missed import for basehandler
from ts.torch_handler.base_handler import BaseHandler
) might not need restart but I can see scenarios where a worker crashed due to an unknown reason where restart will lead to a healthy worker. Is there a way to control when to restart ?
Although we could filter on specific errors to ignore a worker restart, I believe it may be challenging to come up with a comprehensive list of errors to decide which ones to restart the worker on and which ones to ignore.
To keep the implementation simple as it is currently and address the core issue here, which is as you pointed out, on first worker failure, the entire traceback is printed out whereas for subsequent retires, the full traceback does not show up in the logs, making it difficult to find the actual error in the logs. Here's a potential fix to the issue to log the entire traceback on worker retries: #3036
from serve.
Thankyou so much for creating the issues! @namannandan. Understand the complexities with customization of error retries. Agree that solving the core issue of traceback should be good enough for debugging purpose. I will close this issue, thankyou again for your prompt replies.
from serve.
Related Issues (20)
- support install cpp dependency in auto-benchmark
- Custom class handler missing BaseHandler
- Kserve nightly CPU tests failure HOT 1
- gRPC Model Metadata using Open Inference Protocol HOT 1
- `\ready` API for Kubernetes probe to know when TorchServe backend is ready to receive traffic HOT 1
- CPU Launcher fails available check with venv HOT 3
- Can torchserve return image data? HOT 1
- Nightly upload to pypi failed HOT 1
- Conda nightlies not getting uploaded for the last 6 days HOT 1
- Building frontend from source in docker
- v0.10.0 OpenAPI specification does not validate
- improve security doc for model security check
- [RFC]: Deprecation notice for TorchText installation in TorchServe
- Torchserve error for more than 14 python async requests on workflow HOT 2
- TorchServe linux aarch64 plan
- Serve multiple models with both CPU and GPU HOT 3
- How to modify torchserve’s Python runtime from 3.8.0 to 3.10 HOT 1
- TorchServe crashes in production with `WorkerThread - IllegalStateException error' HOT 4
- Unable to use build_image.sh to build the cu102 version of the image HOT 2
- Metrics collector crashes when NVIDIA MIGs are present HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from serve.