Comments (11)
filed JIT ticket for potential improvements: pytorch/pytorch#33354
from serve.
Thanks to @nairbv for the tandem diagnosis on this.
The issue only shows on the model's first forward pass. There's a bunch of precompilation that needs to happen for TorchScript to execute an inference. After that happens, things get much faster. I'll verify that once a worker is hit once, perf improves.
At this time, the only way to kick off this precompilation is to perform a forward pass. We discussed different ways to accommodate this:
- One way is for TorchServe to have a way to generate a valid or valid-looking input to a model. This would require a bunch of new code and config around what constitutes a valid input to a model. Seems like a heavy lift.
- Another way would be for the user to (optionally) include a serialized PyTorch tensor that contains valid input. If such a tensor exists, TorchServe could load it and pass it to the model as part if initialization.
The latter could be exposed as a single, optional flag on torchserve
, something like:
torchserve --start --blahblah --sample_input=valid_input.pt
For the time being, this doesn't need to block launch, but we should make a plan to improve this in future revs.
from serve.
Validated this on the latest master with PT 1.7 on a p3.8xlarge instance with 4 model workers each loaded on a different GPU device and response time is 1.4 seconds
ubuntu@ip-172-31-73-130:~$ time curl -X POST http://localhost:8080/predictions/densenet161_scripted -T serve/examples/image_classifier/kitten.jpg
{
"tiger_cat": 0.46933576464653015,
"tabby": 0.463387668132782,
"Egyptian_cat": 0.06456146389245987,
"lynx": 0.0012828221078962088,
"plastic_bag": 0.00023323048662859946
}
real 0m1.344s
user 0m0.000s
sys 0m0.006s
ubuntu@ip-172-31-73-130:~$
ubuntu@ip-172-31-73-130:~$ time curl -X POST http://localhost:8080/predictions/densenet161_scripted -T serve/examples/image_classifier/kitten.jpg
{
"tiger_cat": 0.46933576464653015,
"tabby": 0.463387668132782,
"Egyptian_cat": 0.06456146389245987,
"lynx": 0.0012828221078962088,
"plastic_bag": 0.00023323048662859946
}
real 0m1.347s
user 0m0.000s
sys 0m0.006s
ubuntu@ip-172-31-73-130:~$
ubuntu@ip-172-31-73-130:~$ time curl -X POST http://localhost:8080/predictions/densenet161_scripted -T serve/examples/image_classifier/kitten.jpg
{
"tiger_cat": 0.46933576464653015,
"tabby": 0.463387668132782,
"Egyptian_cat": 0.06456146389245987,
"lynx": 0.0012828221078962088,
"plastic_bag": 0.00023323048662859946
}
real 0m1.394s
user 0m0.000s
sys 0m0.006s
ubuntu@ip-172-31-73-130:~$
ubuntu@ip-172-31-73-130:~$ time curl -X POST http://localhost:8080/predictions/densenet161_scripted -T serve/examples/image_classifier/kitten.jpg
{
"tiger_cat": 0.46933576464653015,
"tabby": 0.463387668132782,
"Egyptian_cat": 0.06456146389245987,
"lynx": 0.0012828221078962088,
"plastic_bag": 0.00023323048662859946
}
real 0m1.374s
user 0m0.000s
sys 0m0.006s
ubuntu@ip-172-31-73-130:~$
Closing the ticket.
from serve.
sorry, ignore this, i didnt notice that the model was getting deployed onto gpu without any further setup, so that overhead is probably due to the model being on gpu, some CUDA cache coldness. now there seems to be still a slight lag in first calls on CPU, though probably negligible.
Thanks!
from serve.
Note that this isn't a "lag time loading the model" issue - repeated attempts give similar results.
I'll try it with some other models as well, to see how consistent the issue is.
from serve.
@fbbradheintz This seems like PyTorch specific issue.
Please find attached sample prediction code for densenet161 model in eager and torchscript mode.
test_torchscript.txt
test_eager.txt
- TorchScript mode execution time
(base) USL07109 harsh_bafna$ time python test_torchscript.py
['n02123045', 'tabby']
--- 114.89286518096924 seconds ---
real 1m56.721s
user 1m56.270s
sys 0m0.950s
- Eager mode execution time
(base) USL07109 harsh_bafna$ time python test_eager.py
['n02123045', 'tabby']
--- 1.1276381015777588 seconds ---
real 0m1.974s
user 0m1.975s
sys 0m0.409s
We also found following open issue in PyTorch related to performance issue in TorchScript mode :
pytorch/pytorch#30365
from serve.
Ah - I hadn't seen that issue. Will investigate from my side. I'll keep this issue open in the meantime.
from serve.
@harshbafna Can you share the scripts you used for that test?
from serve.
Confirmed fix in the latest 1.7 RC thanks to the fix in pytorch/pytorch#33354.
Followed steps in the original issue description to create a torchscripted densenet model and served it with TS.
time curl -X POST http://127.0.0.1:8080/predictions/tsd161 -T kitten.jpg
{
"282": 0.4693361222743988,
"281": 0.4633875787258148,
"285": 0.06456127017736435,
"287": 0.0012828144244849682,
"728": 0.00023322943889070302
}
real 0m0.496s
user 0m0.004s
sys 0m0.004s
time curl -X POST http://127.0.0.1:8080/predictions/tsd161 -T kitten.jpg
{
"282": 0.4693361222743988,
"281": 0.4633875787258148,
"285": 0.06456127017736435,
"287": 0.0012828144244849682,
"728": 0.00023322943889070302
}
real 0m0.049s
user 0m0.008s
sys 0m0.000s
@chauhang , can we close this issue now or wait until 1.7 is out?
from serve.
Hi,
sorry to bring this up but I thought that this may be the right place.
I'm also having a similar issue with torchserve-nightly
, but interestingly with an eager model. After launching the server, the forward-pass for the very first HTTP request takes around 320ms whereas the subsequent ones take around 9ms. I've measured times of different snippets and indeed, it's the forward
call that takes 99% of this time.
Do you have any ideas?
from serve.
@ozancaglayan That sounds like a distinct issue, so might want to file a separate one. Is this difference torchserve specific, or is it something you can reproduce without torchserve?
from serve.
Related Issues (20)
- TorchServe linux aarch64 plan
- Serve multiple models with both CPU and GPU HOT 3
- How to modify torchserve’s Python runtime from 3.8.0 to 3.10 HOT 1
- TorchServe crashes in production with `WorkerThread - IllegalStateException error' HOT 4
- Unable to use build_image.sh to build the cu102 version of the image HOT 2
- Metrics collector crashes when NVIDIA MIGs are present HOT 1
- Server crashes in production with `WorkerThread - IllegalStateException error' HOT 1
- Whether the pre- and post-processing operations of batch processing are parallel HOT 1
- Update cpp/llamacpp to Llama 3 HOT 1
- Update LLM/llama2 to Llama3
- Update large_models/inferentia2/llama2 to Llama3
- Update large_models/tp_llama to llama3
- Update large_models/gpt_fast to llama3
- How to pass parameters from preprocessing to postprocessing when using micro-batch operations HOT 4
- Load model failed - error: Worker died HOT 5
- Docker regression failure: test_handler_traceback_logging.py
- Exchange Llama2 against Llama3 in HF_accelerate example
- CUDA out of Memory with low Memory Utilization (CUDA error: device-side assert triggered) HOT 4
- If micro_batch_size of micro-batch is set to 1, then model inference is still batch processing? HOT 1
- question to model inference optimization HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from serve.