Comments (11)
This feature is looking well-behaved. Please close this issue after merging.
Test cases covered:
- New Installation (i.e., no stored snapshots):
- Start with no models specified, verify correct startup snapshot
- Start with single model specified on command line, verify correct startup snapshot
- Start with multiple models specified on command line, verify correct startup snapshot
- Management API: Verify correct snapshot emitted & state restored for each
- Register model, no workers
- Register model, with workers
- Register multiple models
- Register model on GPU with number of workers > number of GPUs
- Scale workers up and down
- Unregister models
- Verify correct shutdown snapshot emitted for:
- No models, no workers
- Multiple models, no workers
- Multiple models, different numbers of workers
- Model with number_gpus specified in worker scale-up
- Multiple models, mix of 0 workers & > 0 workers across models
- State restore, both with no snapshot specified (taking most recent), and with a snapshot file specified on the command line:
- No models registered
- Models registered, no workers
- Models registered with workers
- Model specified with batch settings
- Multiple models, some with batch size > 1, others without
- Model with number_gpus specified
from serve.
Marking this as launch blocker as it was included in p0 feedback from the last round.
from serve.
Please review this PR for checkpoint and restart APIs: #52
from serve.
We're looking for a much simpler UX for this revision. I'll be posting an issue with the desired UX laid out clearly later today.
from serve.
Here's the output of our internal deliberations. Hopefully this gives clarity about what this feature entails. We've tried to keep it simple, and take advantage to existing tooling like server config files.
Please let us know
- If there's any clarification required
- If you think it's practical to execute this in the time we have.
TorchServe restart robustness - problem statement:
The intent is to preserve server runtime configuration across sessions such that a TorchServe instance experiencing either a planned or unplanned service stop can restore its state upon restart. State consists of:
- Model archives, which Contain: model code, learning weights, the model handler code, supporting code and data files, model name and version information.
- Server configuration, which comprises: Which models are running, which versions of those models, and how many workers are active for each model.
The model archives are already stored in TorchServe's model store directory; the config may change frequently, based on usage of the Management API.
Additionally, it is anticipated that it may be something about the configuration that put the server in a bad state - e.g. model handler code that does something unintended that fills the disk and renders the server unresponsive. In that case, it may not be advisable to return to the most recent state, and instead you'd want to return to an earlier one. (This is akin to other session recovery behaviors in software - e.g., how Chrome will restart your session if it terminated happily, but asks if you really want to restore your session when it crashes.)
Thus, we're looking for:
- A method of doing this state preservation that minimizes user interaction and implements smart defaults.
- A way to roll back to an earlier server config if desired.
Proposed UX:
Since we already have a format for server configuration - the server config files - it seems straightforward to use these as a vehicle for saving server config snapshots. Model archives should need no further treatment, as they are stored in the model store directory.
The proposed "happy path" user experience looks like:
-
Starting the server for the first time, with no saved state, and only command line or human-generated config file input determine the server's runtime configuration.
-
After successful startup, the server stores its current configuration in a timestamped snapshot file
./logs/configs/<YYYYMMDDHHmmSS>-startup.cfg
-
If a user calls the Management API in a way that changes the server runtime config, another snapshot is saved to
./logs/configs/<YYYYMMDDHHmmSS>-snapshot.cfg
-
The server experiences error-free operation. More Management API-driven changes to server runtime config may occur, each getting a snapshot.
-
When the server is shut down intentionally with
torchserve --stop
, it saves a final snapshot to./logs/configs/<YYYYMMDDHHmmSS>-shutdown.cfg
-
On restart, TorchServe automatically picks up the last configuration file in
./logs/configs
.
If the server experiences an error requiring restart:
- The user restarts the server normally, and TorchServe picks up automatically picks up the last configuration file in
./logs/configs
.
If the server experiences an error related to the configuration (e.g., resource contention that renders the server unresponsive):
-
Upon verifying that the server config may be related to the issue, the user can look in
./logs/configs
for the most recent config from before the error. Because of theYYYYMMDDHHmmSS
timestamp, the config files will automatically be in ascending chronological order inls
or most file browsers, making recent files easy to find. -
The user restarts the server specifying this config file:
torchserve --start --model-store <model store> --ts-config <known good config snapshot>
- just as they would if they were intentionally starting the server with a human-generated config file.
If the user wishes to start without this resiliency feature:
- The user starts the server with
torchserve --start --model-store <model store> --no-config-snapshots
. This prevents to server from storing config snapshot files.
from serve.
@harshbafna Let's discuss this in detail offline about this feature.
from serve.
We did, on the weekly call.
from serve.
@harshbafna while I review these changes can you confirm if this requires commits from checkpoint_api
branch in any way?
from serve.
FWIW I have left comments on https://github.com/pytorch/serve/pull/111/files/a2a8d17552664161a31b7fa2a54abc9607215039 Please revisit them.
from serve.
@harshbafna while I review these changes can you confirm if this requires commits from
checkpoint_api
branch in any way?
No, we have ported whatever we could reuse from the checkpoint_api
branch to the snapshot
branch.
FWIW I have left comments on https://github.com/pytorch/serve/pull/111/files/a2a8d17552664161a31b7fa2a54abc9607215039 Please revisit them.
Added comments on all the queries, will make the required changes for the remaining review comments.
from serve.
PR #171 merged to master
from serve.
Related Issues (20)
- TorchServe linux aarch64 plan
- Serve multiple models with both CPU and GPU HOT 3
- How to modify torchserve’s Python runtime from 3.8.0 to 3.10 HOT 1
- TorchServe crashes in production with `WorkerThread - IllegalStateException error' HOT 4
- Unable to use build_image.sh to build the cu102 version of the image HOT 2
- Metrics collector crashes when NVIDIA MIGs are present HOT 1
- Server crashes in production with `WorkerThread - IllegalStateException error' HOT 1
- Whether the pre- and post-processing operations of batch processing are parallel HOT 1
- Update cpp/llamacpp to Llama 3 HOT 1
- Update LLM/llama2 to Llama3
- Update large_models/inferentia2/llama2 to Llama3
- Update large_models/tp_llama to llama3
- Update large_models/gpt_fast to llama3
- How to pass parameters from preprocessing to postprocessing when using micro-batch operations HOT 4
- Load model failed - error: Worker died HOT 5
- Docker regression failure: test_handler_traceback_logging.py
- Exchange Llama2 against Llama3 in HF_accelerate example
- CUDA out of Memory with low Memory Utilization (CUDA error: device-side assert triggered) HOT 4
- If micro_batch_size of micro-batch is set to 1, then model inference is still batch processing? HOT 1
- question to model inference optimization HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from serve.