🐛 Describe the bug Not able to find a way to disable or reduce ti

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Thankyou so much for creating the issues! <a class="user-mention notranslate" data-hov

Reduce or remove worker retries for specific failures about serve HOT 6 CLOSED

harshita-meena commented on May 28, 2024

Reduce or remove worker retries for specific failures

from serve.

Comments (6)

namannandan commented on May 28, 2024 1

@namannandan Thankyou ! I tried setting the value to 0 and i see it did not restart the model. The only confusing thing was when I query the model, i do not see it at that endpoint. It is part of model config though.
curl http://localhost:81/models/toy-ranker
[
  {
    "modelName": "toy-ranker",
    "modelVersion": "2024-03-21-15:57",
    "modelUrl": "toy-ranker.mar",
    "runtime": "python",
    "minWorkers": 4,
    "maxWorkers": 4,
    "batchSize": 1,
    "maxBatchDelay": 100,
    "loadedAtStartup": true,
......
But this resolves my question, and make sense to not use it in prod.

Some of the model configuration options are not shown in the describe model API response, for ex: maxRetryTimeoutInSec but I'd like to confirm that including the configuration say maxRetryTimeoutInSec: 100 in your model-config.yaml will apply the configuration to the model. Created a follow up issue to track update of describe model API response to include all model configuration values: #3037

from serve.

namannandan commented on May 28, 2024

@harshita-meena maxRetryTimeoutInSec is the duration for which the worker will be attempted to be restarted. The delay between restart attempts made within the maxRetryTimeoutInSec follows backoff(in seconds) and is specified here:

serve/frontend/server/src/main/java/org/pytorch/serve/wlm/WorkerThread.java

Lines 51 to 52 in 13d092c

 private static final int[] BACK_OFF = { 

 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597

Although, maxRetryTimeoutInSec is set to 100 seconds, it is still possible that multiple attempts are made within that duration to restart the worker.

While debugging your handler, if you'd like for the worker to be attempted to be started only once and give up on failure you could set the maxRetryTimeoutInSec to 0. This way, no retry attempts will be made. Note that, this not recommended configurations for a production setting.

from serve.

harshita-meena commented on May 28, 2024

@namannandan Thankyou ! I tried setting the value to 0 and i see it did not restart the model. The only confusing thing was when I query the model, i do not see it at that endpoint. It is part of model config though.

curl http://localhost:81/models/toy-ranker
[
  {
    "modelName": "toy-ranker",
    "modelVersion": "2024-03-21-15:57",
    "modelUrl": "toy-ranker.mar",
    "runtime": "python",
    "minWorkers": 4,
    "maxWorkers": 4,
    "batchSize": 1,
    "maxBatchDelay": 100,
    "loadedAtStartup": true,
......

But this resolves my question, and make sense to not use it in prod.

from serve.

harshita-meena commented on May 28, 2024

@namannandan What will be your recommendation for specific cases of when to restart the server, for example a very basic error that is just related to syntax (eg : missed import for basehandler from ts.torch_handler.base_handler import BaseHandler) might not need restart but I can see scenarios where a worker crashed due to an unknown reason where restart will lead to a healthy worker. Is there a way to control when to restart ?

from serve.

namannandan commented on May 28, 2024

@namannandan What will be your recommendation for specific cases of when to restart the server, for example a very basic error that is just related to syntax (eg : missed import for basehandler from ts.torch_handler.base_handler import BaseHandler) might not need restart but I can see scenarios where a worker crashed due to an unknown reason where restart will lead to a healthy worker. Is there a way to control when to restart ?

Although we could filter on specific errors to ignore a worker restart, I believe it may be challenging to come up with a comprehensive list of errors to decide which ones to restart the worker on and which ones to ignore.

To keep the implementation simple as it is currently and address the core issue here, which is as you pointed out, on first worker failure, the entire traceback is printed out whereas for subsequent retires, the full traceback does not show up in the logs, making it difficult to find the actual error in the logs. Here's a potential fix to the issue to log the entire traceback on worker retries: #3036

from serve.

harshita-meena commented on May 28, 2024

Thankyou so much for creating the issues! @namannandan. Understand the complexities with customization of error retries. Agree that solving the core issue of traceback should be good enough for debugging purpose. I will close this issue, thankyou again for your prompt replies.

from serve.

Reduce or remove worker retries for specific failures about serve HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	private static final int[] BACK_OFF = {
	0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597