Previous high-priority item from the last round of feedback: <p dir="

Please review this PR for checkpoint and restart APIs: <a class="issue-link js-issue-l

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

FWIW I have left comments on <a href="https://github.com/pytorch/serve/pull/111/files/

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Config changes are not preserved about serve HOT 11 CLOSED

pytorch commented on May 14, 2024

Config changes are not preserved

from serve.

Comments (11)

fbbradheintz commented on May 14, 2024 1

This feature is looking well-behaved. Please close this issue after merging.

Test cases covered:

New Installation (i.e., no stored snapshots):
- Start with no models specified, verify correct startup snapshot
- Start with single model specified on command line, verify correct startup snapshot
- Start with multiple models specified on command line, verify correct startup snapshot
Management API: Verify correct snapshot emitted & state restored for each
- Register model, no workers
- Register model, with workers
- Register multiple models
- Register model on GPU with number of workers > number of GPUs
- Scale workers up and down
- Unregister models
Verify correct shutdown snapshot emitted for:
- No models, no workers
- Multiple models, no workers
- Multiple models, different numbers of workers
- Model with number_gpus specified in worker scale-up
- Multiple models, mix of 0 workers & > 0 workers across models
State restore, both with no snapshot specified (taking most recent), and with a snapshot file specified on the command line:
- No models registered
- Models registered, no workers
- Models registered with workers
- Model specified with batch settings
- Multiple models, some with batch size > 1, others without
- Model with number_gpus specified

from serve.

fbbradheintz commented on May 14, 2024

Marking this as launch blocker as it was included in p0 feedback from the last round.

from serve.

mycpuorg commented on May 14, 2024

Please review this PR for checkpoint and restart APIs: #52

from serve.

fbbradheintz commented on May 14, 2024

We're looking for a much simpler UX for this revision. I'll be posting an issue with the desired UX laid out clearly later today.

from serve.

fbbradheintz commented on May 14, 2024

Here's the output of our internal deliberations. Hopefully this gives clarity about what this feature entails. We've tried to keep it simple, and take advantage to existing tooling like server config files.

Please let us know

If there's any clarification required
If you think it's practical to execute this in the time we have.

TorchServe restart robustness - problem statement:

The intent is to preserve server runtime configuration across sessions such that a TorchServe instance experiencing either a planned or unplanned service stop can restore its state upon restart. State consists of:

Model archives, which Contain: model code, learning weights, the model handler code, supporting code and data files, model name and version information.
Server configuration, which comprises: Which models are running, which versions of those models, and how many workers are active for each model.

The model archives are already stored in TorchServe's model store directory; the config may change frequently, based on usage of the Management API.

Additionally, it is anticipated that it may be something about the configuration that put the server in a bad state - e.g. model handler code that does something unintended that fills the disk and renders the server unresponsive. In that case, it may not be advisable to return to the most recent state, and instead you'd want to return to an earlier one. (This is akin to other session recovery behaviors in software - e.g., how Chrome will restart your session if it terminated happily, but asks if you really want to restore your session when it crashes.)

Thus, we're looking for:

A method of doing this state preservation that minimizes user interaction and implements smart defaults.
A way to roll back to an earlier server config if desired.

Proposed UX:

Since we already have a format for server configuration - the server config files - it seems straightforward to use these as a vehicle for saving server config snapshots. Model archives should need no further treatment, as they are stored in the model store directory.

The proposed "happy path" user experience looks like:

Starting the server for the first time, with no saved state, and only command line or human-generated config file input determine the server's runtime configuration.
After successful startup, the server stores its current configuration in a timestamped snapshot file ./logs/configs/<YYYYMMDDHHmmSS>-startup.cfg
If a user calls the Management API in a way that changes the server runtime config, another snapshot is saved to ./logs/configs/<YYYYMMDDHHmmSS>-snapshot.cfg
The server experiences error-free operation. More Management API-driven changes to server runtime config may occur, each getting a snapshot.
When the server is shut down intentionally with torchserve --stop, it saves a final snapshot to ./logs/configs/<YYYYMMDDHHmmSS>-shutdown.cfg
On restart, TorchServe automatically picks up the last configuration file in ./logs/configs.

If the server experiences an error requiring restart:

The user restarts the server normally, and TorchServe picks up automatically picks up the last configuration file in ./logs/configs.

If the server experiences an error related to the configuration (e.g., resource contention that renders the server unresponsive):

Upon verifying that the server config may be related to the issue, the user can look in ./logs/configs for the most recent config from before the error. Because of the YYYYMMDDHHmmSS timestamp, the config files will automatically be in ascending chronological order in ls or most file browsers, making recent files easy to find.
The user restarts the server specifying this config file: torchserve --start --model-store <model store> --ts-config <known good config snapshot> - just as they would if they were intentionally starting the server with a human-generated config file.

If the user wishes to start without this resiliency feature:

The user starts the server with torchserve --start --model-store <model store> --no-config-snapshots. This prevents to server from storing config snapshot files.

from serve.

mycpuorg commented on May 14, 2024

@harshbafna Let's discuss this in detail offline about this feature.

from serve.

fbbradheintz commented on May 14, 2024

We did, on the weekly call.

from serve.

mycpuorg commented on May 14, 2024

@harshbafna while I review these changes can you confirm if this requires commits from checkpoint_api branch in any way?

from serve.

mycpuorg commented on May 14, 2024

FWIW I have left comments on https://github.com/pytorch/serve/pull/111/files/a2a8d17552664161a31b7fa2a54abc9607215039 Please revisit them.

from serve.

harshbafna commented on May 14, 2024

@harshbafna while I review these changes can you confirm if this requires commits from checkpoint_api branch in any way?

No, we have ported whatever we could reuse from the checkpoint_api branch to the snapshot branch.

FWIW I have left comments on https://github.com/pytorch/serve/pull/111/files/a2a8d17552664161a31b7fa2a54abc9607215039 Please revisit them.

Added comments on all the queries, will make the required changes for the remaining review comments.

from serve.

mycpuorg commented on May 14, 2024

PR #171 merged to master

from serve.

Config changes are not preserved about serve HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent