I have used gte-tiny embeddings for my custom NER model and need to speed up the infer

thanku <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

inference time cpu vs gpu about spanmarkerner HOT 3 OPEN

tomaarsen commented on May 18, 2024

inference time cpu vs gpu

from spanmarkerner.

Comments (3)

tomaarsen commented on May 18, 2024

You may experience improved speed if you use SpanMarkerModel.from_pretrained(..., torch_dtype=torch.float16) or torch.bfloat16. See e.g.:

import time
import torch
from span_marker import SpanMarkerModel

model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super", torch_dtype=torch.bfloat16, device_map="cuda")
# model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super", device_map="cuda")

text = [
    "Leonardo da Vinci recently published a scientific paper on combatting Mitocromulent disease. Leonardo da Vinci painted the most famous painting in existence: the Mona Lisa.",
    "Leonardo da Vinci scored a critical goal towards the end of the second half. Leonardo da Vinci controversially veto'd a bill regarding public health care last friday. Leonardo da Vinci was promoted to Sergeant after his outstanding work in the war."
]
BS = 64
N = 500
model.predict(text * 50, batch_size=BS)
start_t = time.time()
model.predict(text * N, batch_size=BS)
print(f"{time.time() - start_t:8f}s for {N * 2} samples with batch_size={BS} and torch_dtype={model.dtype}.")

This gave me:

20.745640s for 1000 samples with batch_size=64 and torch_dtype=torch.float16.

16.534876s for 1000 samples with batch_size=64 and torch_dtype=torch.bfloat16.

and

39.655506s for 1000 samples with batch_size=64 and torch_dtype=torch.float32.

Note that float16 is not available on CPU though! Not sure about bfloat16.

If you have a Linux (or Mac?) device, then you can also use load_in_8bit=True and load_in_4bit=True by installing bitsandbytes, but I don't know if that improves inference speed - this is also only for CUDA.

Beyond that the steps to increase the inference speeds become pretty challenging. Hope this helps a bit.

Also, you can process about 8 sentences per second with CPU and about 110 sentences per second in GPU, is that not sufficiently fast yet?

Tom Aarsen

from spanmarkerner.

ganga7445 commented on May 18, 2024

thanku @tomaarsen
Using torch.float16 was working for me. It would be excellent if the operation could be completed in less than one second with a batch size of 256.

Batch Size	Average Inference Time (ms)	new inference time(ms)
16	0.14945	0.09211015701
32	0.28	0.1645913124
64	0.51582	0.2973537445
128	1.10669	0.6381671429
256	2.24729	1.238643169

from spanmarkerner.

tomaarsen commented on May 18, 2024

@polodealvarado started working on ONNX support here: #26 (comment)
If we can make it work, perhaps then we can improve the speed even further. Until then, it will be hard to get even faster results. Less than a second for a batch size of 256 equals 256 sentences per second, that is already quite efficient.

Tom Aarsen

from spanmarkerner.

inference time cpu vs gpu about spanmarkerner HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent