I have two identical models, one in code + weights, the other in Torch. Doing in

filed JIT ticket for potential improvements: <a class="issue-link js-issue-link" data-

Thanks to <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Confirmed fix in the latest 1.7 RC thanks to the fix in <a href="https://github.com/py

Long wait times for first request from TorchScript model about serve HOT 11 CLOSED

pytorch commented on May 15, 2024

Long wait times for first request from TorchScript model

from serve.

Comments (11)

nairbv commented on May 15, 2024 4

filed JIT ticket for potential improvements: pytorch/pytorch#33354

from serve.

fbbradheintz commented on May 15, 2024 1

Thanks to @nairbv for the tandem diagnosis on this.

The issue only shows on the model's first forward pass. There's a bunch of precompilation that needs to happen for TorchScript to execute an inference. After that happens, things get much faster. I'll verify that once a worker is hit once, perf improves.

At this time, the only way to kick off this precompilation is to perform a forward pass. We discussed different ways to accommodate this:

One way is for TorchServe to have a way to generate a valid or valid-looking input to a model. This would require a bunch of new code and config around what constitutes a valid input to a model. Seems like a heavy lift.
Another way would be for the user to (optionally) include a serialized PyTorch tensor that contains valid input. If such a tensor exists, TorchServe could load it and pass it to the model as part if initialization.

The latter could be exposed as a single, optional flag on torchserve, something like:

torchserve --start --blahblah --sample_input=valid_input.pt

For the time being, this doesn't need to block launch, but we should make a plan to improve this in future revs.

from serve.

harshbafna commented on May 15, 2024 1

Validated this on the latest master with PT 1.7 on a p3.8xlarge instance with 4 model workers each loaded on a different GPU device and response time is 1.4 seconds

ubuntu@ip-172-31-73-130:~$ time curl -X POST http://localhost:8080/predictions/densenet161_scripted -T serve/examples/image_classifier/kitten.jpg 
{
  "tiger_cat": 0.46933576464653015,
  "tabby": 0.463387668132782,
  "Egyptian_cat": 0.06456146389245987,
  "lynx": 0.0012828221078962088,
  "plastic_bag": 0.00023323048662859946
}
real	0m1.344s
user	0m0.000s
sys	0m0.006s
ubuntu@ip-172-31-73-130:~$

ubuntu@ip-172-31-73-130:~$ time curl -X POST http://localhost:8080/predictions/densenet161_scripted -T serve/examples/image_classifier/kitten.jpg 
{
  "tiger_cat": 0.46933576464653015,
  "tabby": 0.463387668132782,
  "Egyptian_cat": 0.06456146389245987,
  "lynx": 0.0012828221078962088,
  "plastic_bag": 0.00023323048662859946
}
real	0m1.347s
user	0m0.000s
sys	0m0.006s
ubuntu@ip-172-31-73-130:~$

ubuntu@ip-172-31-73-130:~$ time curl -X POST http://localhost:8080/predictions/densenet161_scripted -T serve/examples/image_classifier/kitten.jpg 
{
  "tiger_cat": 0.46933576464653015,
  "tabby": 0.463387668132782,
  "Egyptian_cat": 0.06456146389245987,
  "lynx": 0.0012828221078962088,
  "plastic_bag": 0.00023323048662859946
}
real	0m1.394s
user	0m0.000s
sys	0m0.006s
ubuntu@ip-172-31-73-130:~$

ubuntu@ip-172-31-73-130:~$ time curl -X POST http://localhost:8080/predictions/densenet161_scripted -T serve/examples/image_classifier/kitten.jpg 
{
  "tiger_cat": 0.46933576464653015,
  "tabby": 0.463387668132782,
  "Egyptian_cat": 0.06456146389245987,
  "lynx": 0.0012828221078962088,
  "plastic_bag": 0.00023323048662859946
}
real	0m1.374s
user	0m0.000s
sys	0m0.006s
ubuntu@ip-172-31-73-130:~$

Closing the ticket.

from serve.

ozancaglayan commented on May 15, 2024 1

sorry, ignore this, i didnt notice that the model was getting deployed onto gpu without any further setup, so that overhead is probably due to the model being on gpu, some CUDA cache coldness. now there seems to be still a slight lag in first calls on CPU, though probably negligible.

Thanks!

from serve.

fbbradheintz commented on May 15, 2024

Note that this isn't a "lag time loading the model" issue - repeated attempts give similar results.

I'll try it with some other models as well, to see how consistent the issue is.

from serve.

harshbafna commented on May 15, 2024

@fbbradheintz This seems like PyTorch specific issue.

Please find attached sample prediction code for densenet161 model in eager and torchscript mode.

test_torchscript.txt
test_eager.txt

TorchScript mode execution time

(base) USL07109 harsh_bafna$ time python test_torchscript.py 
['n02123045', 'tabby']
--- 114.89286518096924 seconds ---

real	1m56.721s
user	1m56.270s
sys	0m0.950s

Eager mode execution time

(base) USL07109 harsh_bafna$ time python test_eager.py 
['n02123045', 'tabby']
--- 1.1276381015777588 seconds ---

real	0m1.974s
user	0m1.975s
sys	0m0.409s

We also found following open issue in PyTorch related to performance issue in TorchScript mode :
pytorch/pytorch#30365

from serve.

fbbradheintz commented on May 15, 2024

Ah - I hadn't seen that issue. Will investigate from my side. I'll keep this issue open in the meantime.

from serve.

fbbradheintz commented on May 15, 2024

@harshbafna Can you share the scripts you used for that test?

from serve.

jeremiahschung commented on May 15, 2024

Confirmed fix in the latest 1.7 RC thanks to the fix in pytorch/pytorch#33354.

Followed steps in the original issue description to create a torchscripted densenet model and served it with TS.

time curl -X POST http://127.0.0.1:8080/predictions/tsd161 -T kitten.jpg
{
  "282": 0.4693361222743988,
  "281": 0.4633875787258148,
  "285": 0.06456127017736435,
  "287": 0.0012828144244849682,
  "728": 0.00023322943889070302
}
real	0m0.496s
user	0m0.004s
sys	0m0.004s

time curl -X POST http://127.0.0.1:8080/predictions/tsd161 -T kitten.jpg
{
  "282": 0.4693361222743988,
  "281": 0.4633875787258148,
  "285": 0.06456127017736435,
  "287": 0.0012828144244849682,
  "728": 0.00023322943889070302
}
real	0m0.049s
user	0m0.008s
sys	0m0.000s

@chauhang , can we close this issue now or wait until 1.7 is out?

from serve.

ozancaglayan commented on May 15, 2024

Hi,

sorry to bring this up but I thought that this may be the right place.

I'm also having a similar issue with torchserve-nightly, but interestingly with an eager model. After launching the server, the forward-pass for the very first HTTP request takes around 320ms whereas the subsequent ones take around 9ms. I've measured times of different snippets and indeed, it's the forward call that takes 99% of this time.

Do you have any ideas?

from serve.

nairbv commented on May 15, 2024

@ozancaglayan That sounds like a distinct issue, so might want to file a separate one. Is this difference torchserve specific, or is it something you can reproduce without torchserve?

from serve.

Long wait times for first request from TorchScript model about serve HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent