Giter VIP home page Giter VIP logo

Comments (11)

fbbradheintz avatar fbbradheintz commented on May 14, 2024 1

This feature is looking well-behaved. Please close this issue after merging.

Test cases covered:

  • New Installation (i.e., no stored snapshots):
    • Start with no models specified, verify correct startup snapshot
    • Start with single model specified on command line, verify correct startup snapshot
    • Start with multiple models specified on command line, verify correct startup snapshot
  • Management API: Verify correct snapshot emitted & state restored for each
    • Register model, no workers
    • Register model, with workers
    • Register multiple models
    • Register model on GPU with number of workers > number of GPUs
    • Scale workers up and down
    • Unregister models
  • Verify correct shutdown snapshot emitted for:
    • No models, no workers
    • Multiple models, no workers
    • Multiple models, different numbers of workers
    • Model with number_gpus specified in worker scale-up
    • Multiple models, mix of 0 workers & > 0 workers across models
  • State restore, both with no snapshot specified (taking most recent), and with a snapshot file specified on the command line:
    • No models registered
    • Models registered, no workers
    • Models registered with workers
    • Model specified with batch settings
    • Multiple models, some with batch size > 1, others without
    • Model with number_gpus specified

from serve.

fbbradheintz avatar fbbradheintz commented on May 14, 2024

Marking this as launch blocker as it was included in p0 feedback from the last round.

from serve.

mycpuorg avatar mycpuorg commented on May 14, 2024

Please review this PR for checkpoint and restart APIs: #52

from serve.

fbbradheintz avatar fbbradheintz commented on May 14, 2024

We're looking for a much simpler UX for this revision. I'll be posting an issue with the desired UX laid out clearly later today.

from serve.

fbbradheintz avatar fbbradheintz commented on May 14, 2024

Here's the output of our internal deliberations. Hopefully this gives clarity about what this feature entails. We've tried to keep it simple, and take advantage to existing tooling like server config files.

Please let us know

  • If there's any clarification required
  • If you think it's practical to execute this in the time we have.

TorchServe restart robustness - problem statement:

The intent is to preserve server runtime configuration across sessions such that a TorchServe instance experiencing either a planned or unplanned service stop can restore its state upon restart. State consists of:

  • Model archives, which Contain: model code, learning weights, the model handler code, supporting code and data files, model name and version information.
  • Server configuration, which comprises: Which models are running, which versions of those models, and how many workers are active for each model.

The model archives are already stored in TorchServe's model store directory; the config may change frequently, based on usage of the Management API.

Additionally, it is anticipated that it may be something about the configuration that put the server in a bad state - e.g. model handler code that does something unintended that fills the disk and renders the server unresponsive. In that case, it may not be advisable to return to the most recent state, and instead you'd want to return to an earlier one. (This is akin to other session recovery behaviors in software - e.g., how Chrome will restart your session if it terminated happily, but asks if you really want to restore your session when it crashes.)

Thus, we're looking for:

  • A method of doing this state preservation that minimizes user interaction and implements smart defaults.
  • A way to roll back to an earlier server config if desired.

Proposed UX:

Since we already have a format for server configuration - the server config files - it seems straightforward to use these as a vehicle for saving server config snapshots. Model archives should need no further treatment, as they are stored in the model store directory.

The proposed "happy path" user experience looks like:

  1. Starting the server for the first time, with no saved state, and only command line or human-generated config file input determine the server's runtime configuration.

  2. After successful startup, the server stores its current configuration in a timestamped snapshot file ./logs/configs/<YYYYMMDDHHmmSS>-startup.cfg

  3. If a user calls the Management API in a way that changes the server runtime config, another snapshot is saved to ./logs/configs/<YYYYMMDDHHmmSS>-snapshot.cfg

  4. The server experiences error-free operation. More Management API-driven changes to server runtime config may occur, each getting a snapshot.

  5. When the server is shut down intentionally with torchserve --stop, it saves a final snapshot to ./logs/configs/<YYYYMMDDHHmmSS>-shutdown.cfg

  6. On restart, TorchServe automatically picks up the last configuration file in ./logs/configs.

If the server experiences an error requiring restart:

  1. The user restarts the server normally, and TorchServe picks up automatically picks up the last configuration file in ./logs/configs.

If the server experiences an error related to the configuration (e.g., resource contention that renders the server unresponsive):

  1. Upon verifying that the server config may be related to the issue, the user can look in ./logs/configs for the most recent config from before the error. Because of the YYYYMMDDHHmmSS timestamp, the config files will automatically be in ascending chronological order in ls or most file browsers, making recent files easy to find.

  2. The user restarts the server specifying this config file: torchserve --start --model-store <model store> --ts-config <known good config snapshot> - just as they would if they were intentionally starting the server with a human-generated config file.

If the user wishes to start without this resiliency feature:

  1. The user starts the server with torchserve --start --model-store <model store> --no-config-snapshots. This prevents to server from storing config snapshot files.

from serve.

mycpuorg avatar mycpuorg commented on May 14, 2024

@harshbafna Let's discuss this in detail offline about this feature.

from serve.

fbbradheintz avatar fbbradheintz commented on May 14, 2024

We did, on the weekly call.

from serve.

mycpuorg avatar mycpuorg commented on May 14, 2024

@harshbafna while I review these changes can you confirm if this requires commits from checkpoint_api branch in any way?

from serve.

mycpuorg avatar mycpuorg commented on May 14, 2024

FWIW I have left comments on https://github.com/pytorch/serve/pull/111/files/a2a8d17552664161a31b7fa2a54abc9607215039 Please revisit them.

from serve.

harshbafna avatar harshbafna commented on May 14, 2024

@harshbafna while I review these changes can you confirm if this requires commits from checkpoint_api branch in any way?

No, we have ported whatever we could reuse from the checkpoint_api branch to the snapshot branch.

FWIW I have left comments on https://github.com/pytorch/serve/pull/111/files/a2a8d17552664161a31b7fa2a54abc9607215039 Please revisit them.

Added comments on all the queries, will make the required changes for the remaining review comments.

from serve.

mycpuorg avatar mycpuorg commented on May 14, 2024

PR #171 merged to master

from serve.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.