Giter VIP home page Giter VIP logo

Comments (6)

namannandan avatar namannandan commented on May 28, 2024 1

@namannandan Thankyou ! I tried setting the value to 0 and i see it did not restart the model. The only confusing thing was when I query the model, i do not see it at that endpoint. It is part of model config though.

curl http://localhost:81/models/toy-ranker
[
  {
    "modelName": "toy-ranker",
    "modelVersion": "2024-03-21-15:57",
    "modelUrl": "toy-ranker.mar",
    "runtime": "python",
    "minWorkers": 4,
    "maxWorkers": 4,
    "batchSize": 1,
    "maxBatchDelay": 100,
    "loadedAtStartup": true,
......

But this resolves my question, and make sense to not use it in prod.

Some of the model configuration options are not shown in the describe model API response, for ex: maxRetryTimeoutInSec but I'd like to confirm that including the configuration say maxRetryTimeoutInSec: 100 in your model-config.yaml will apply the configuration to the model. Created a follow up issue to track update of describe model API response to include all model configuration values: #3037

from serve.

namannandan avatar namannandan commented on May 28, 2024

@harshita-meena maxRetryTimeoutInSec is the duration for which the worker will be attempted to be restarted. The delay between restart attempts made within the maxRetryTimeoutInSec follows backoff(in seconds) and is specified here:

private static final int[] BACK_OFF = {
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597

Although, maxRetryTimeoutInSec is set to 100 seconds, it is still possible that multiple attempts are made within that duration to restart the worker.

While debugging your handler, if you'd like for the worker to be attempted to be started only once and give up on failure you could set the maxRetryTimeoutInSec to 0. This way, no retry attempts will be made. Note that, this not recommended configurations for a production setting.

from serve.

harshita-meena avatar harshita-meena commented on May 28, 2024

@namannandan Thankyou ! I tried setting the value to 0 and i see it did not restart the model. The only confusing thing was when I query the model, i do not see it at that endpoint. It is part of model config though.

curl http://localhost:81/models/toy-ranker
[
  {
    "modelName": "toy-ranker",
    "modelVersion": "2024-03-21-15:57",
    "modelUrl": "toy-ranker.mar",
    "runtime": "python",
    "minWorkers": 4,
    "maxWorkers": 4,
    "batchSize": 1,
    "maxBatchDelay": 100,
    "loadedAtStartup": true,
......

But this resolves my question, and make sense to not use it in prod.

from serve.

harshita-meena avatar harshita-meena commented on May 28, 2024

@namannandan What will be your recommendation for specific cases of when to restart the server, for example a very basic error that is just related to syntax (eg : missed import for basehandler from ts.torch_handler.base_handler import BaseHandler) might not need restart but I can see scenarios where a worker crashed due to an unknown reason where restart will lead to a healthy worker. Is there a way to control when to restart ?

from serve.

namannandan avatar namannandan commented on May 28, 2024

@namannandan What will be your recommendation for specific cases of when to restart the server, for example a very basic error that is just related to syntax (eg : missed import for basehandler from ts.torch_handler.base_handler import BaseHandler) might not need restart but I can see scenarios where a worker crashed due to an unknown reason where restart will lead to a healthy worker. Is there a way to control when to restart ?

Although we could filter on specific errors to ignore a worker restart, I believe it may be challenging to come up with a comprehensive list of errors to decide which ones to restart the worker on and which ones to ignore.

To keep the implementation simple as it is currently and address the core issue here, which is as you pointed out, on first worker failure, the entire traceback is printed out whereas for subsequent retires, the full traceback does not show up in the logs, making it difficult to find the actual error in the logs. Here's a potential fix to the issue to log the entire traceback on worker retries: #3036

from serve.

harshita-meena avatar harshita-meena commented on May 28, 2024

Thankyou so much for creating the issues! @namannandan. Understand the complexities with customization of error retries. Agree that solving the core issue of traceback should be good enough for debugging purpose. I will close this issue, thankyou again for your prompt replies.

from serve.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.