Comments (8)
Hello, I think the data transfer from Host to Device is more optimized than Device to Host. Additionally, the time for model transfer includes also the time for allocation data. As far as I know, the allocation on CPU takes more time than once on GPU, so it is not surprising that the unloading process takes more time.
I found some tests here
from ctranslate2.
@minhthuc2502 thanks for the reply.
Since posting the issue I went to study the code a bit and found something that can solve the issue in my use case.
I noticed that every call for Translator.load_model
or Translator.unload_model
performs _cached_models.clear()
.
If I understand correctly, the existing of non-empty _cached_models
are responsible for quick quick loading of the translator to host.
I thought that if it's possible to make clearing _cached_models
optional (maybe add additional optional argument clear_cache) then it will make the following possible:
- During initialization:
Translator.unload_model(to_cpu=True)
- this will still be slow which is ok. - During request and before inference:
Translator.load_model(clear_cache=False)
- will be optimized because we have non empty_cached_models
and also it will not delete the_cached_models
- During request and after inference
Translator.unload_model(clear_cache=False)
will unload the model fast and keep the cache to load the model fast next time.
Here are the relevant lines in the code:
CTranslate2/python/cpp/translator.cc
Line 299 in bfa0cb3
If I did undestand the code correctly, then it doesn't matter that the transfer from host to device is less optimized, since it will happen only on first time during the initialization, and then we won't need to transfer it anymore since it will exist in _cached_models
.
from ctranslate2.
Yes in your use case, you can keep the unloaded model in cache to avoid unloading it all time in the future. Only loading model is needed.
from ctranslate2.
It was a bit tougher then I thought, but I think I got it working.
I had to write a new method to clone the cached models instead of moving them.
So now it works in the following way:
load model:
- loads the models to gpu
- moves the models to cpu
- clones the models while they are in cpu
- moves the models clones to gpu
unload_model:
Never moves the models.
If it is called with to_cpu=True
, just clears the cahced models in gpu, else: clears both in cpu and gpu.
It still works a bit funny though. These are now the steps I need to do in order to initialize the model:
translator = Translator(model_path, "cuda", compute_type) # (takes ~6 seconds)
translator.unload_model(to_cpu=True) # (takes a few ms)
translator.load_model() # (takes ~14 seconds)
And from here on it works as I intended, no matter how many times I do it (which is what I wanted 😁):
translator.unload_model(to_cpu=True) # (takes a few ms)
translator.load_model() # (takes ~1.5 seconds)
And it unloads fully when I call: translator.unload_model()
I'm a bit rusty on my C++, so the code might be not suitable for everyone, but these are the changes I introduced to the module python/cpp/translator.cc
:
void unload_model(const bool to_cpu) {
if (to_cpu && _device == Device::CPU)
return;
// Do not unload the model if some batches are still being processed.
if (_pool->num_active_batches() > 0)
return;
// If the lock is not acquired immediately it means the model is being used
// in another thread and we can't unload it at this time.
std::unique_lock lock(_mutex, std::try_to_lock);
if (!lock || !_model_is_loaded)
return;
_gpu_cached_models = _pool->detach_models();
if (!to_cpu)
_cpu_cached_models.clear();
else
_gpu_cached_models.clear();
// We clear the CUDA allocator cache to further reduce the memory after unloading the model.
if (_device == Device::CUDA)
_pool->clear_cache();
_model_is_loaded = false;
}
void load_model() {
std::unique_lock lock(_mutex);
if (_model_is_loaded)
return;
if (_cpu_cached_models.empty()) {
_cpu_cached_models = _model_loader.load();
move_cached_models(Device::CPU, std::vector<int>(_cpu_cached_models.size(), 0), _cpu_cached_models);
_gpu_cached_models = _model_loader.load();
} else {
_gpu_cached_models = copy_cached_models(_device, _device_index);
move_cached_models(_device, _device_index, _gpu_cached_models, _num_replicas_per_device);
}
_pool->set_models(_gpu_cached_models);
_gpu_cached_models.clear();
_model_is_loaded = true;
}
private:
const Device _device;
const std::vector<int>& _device_index;
const size_t _num_replicas_per_device;
std::vector<std::shared_ptr<const models::Model>> _cpu_cached_models;
std::vector<std::shared_ptr<const models::Model>> _gpu_cached_models;
bool _model_is_loaded;
// Use a shared mutex to protect the model state (loaded/unloaded).
// Multiple threads can read the model at the same time, but a single thread can change
// the model state (e.g. load or unload the model).
std::shared_mutex _mutex;
void assert_model_is_ready() const {
if (!_model_is_loaded)
throw std::runtime_error("The model for this translator was unloaded");
}
void move_cached_models(Device device,
const std::vector<int>& device_index,
std::vector<std::shared_ptr<const models::Model>> _cached_models,
size_t num_models_per_device = 1) {
for (size_t i = 0; i < _cached_models.size(); ++i) {
auto& model = const_cast<models::Model&>(*_cached_models[i]);
model.set_device(device, device_index[i / num_models_per_device]);
}
}
std::vector<std::shared_ptr<const models::Model>> copy_cached_models(Device device,
const std::vector<int>& device_index,
size_t num_models_per_device = 1) {
std::vector<std::shared_ptr<const models::Model>> copied_models;
for (size_t i = 0; i < _cpu_cached_models.size(); ++i) {
auto& model = const_cast<models::Model&>(*_cpu_cached_models[i]);
auto copied_model = model.copy_to(device, device_index[i / num_models_per_device]);
copied_models.push_back(copied_model);
}
return copied_models;
}
};
I would be happy if you can suggest how can I avoid calling translator.load_model()
directly after invoking the Translator constructor? I couldn't understand which code is executed when the constructor is called and how can I make it do the same actions that my updated load_model
method does?
Also if you have any other suggestions I would be happy to hear 😁.
from ctranslate2.
I improved the code quality and also implement it in a way that preserves also the old behavior by adding a flag for persistent_cpu_cache.
I opened a PR for this change.
#1645
from ctranslate2.
Hello, I think this PR is specific to your use case only. The behavior becomes more complex. The parameter persist_cpu_cache
is not relevant when a Translator
is created. It makes more sense when loading or unloading a model. As far as I understand, you can just avoid cache clearing when loading model, something like this:
void load_model(bool keep_cache) {
std::unique_lock lock(_mutex);
if (_model_is_loaded)
return;
if (_cached_models.empty()) {
_cached_models = _model_loader.load();
} else {
move_cached_models(_device, _device_index, _num_replicas_per_device);
}
_pool->set_models(_cached_models);
if (!keep_cache)
_cached_models.clear();
_model_is_loaded = true;
}
void unload_model(const bool to_cpu) {
if (to_cpu && _device == Device::CPU)
return;
// Do not unload the model if some batches are still being processed.
if (_pool->num_active_batches() > 0)
return;
// If the lock is not acquired immediately it means the model is being used
// in another thread and we can't unload it at this time.
std::unique_lock lock(_mutex, std::try_to_lock);
if (!lock || !_model_is_loaded)
return;
if (_cached_models.empty) {
_cached_models = _pool->detach_models();
if (to_cpu)
move_cached_models(Device::CPU, std::vector<int>(_cached_models.size(), 0));
else
_cached_models.clear();
}
// We clear the CUDA allocator cache to further reduce the memory after unloading the model.
if (_device == Device::CUDA)
_pool->clear_cache();
_model_is_loaded = false;
}
I think you can keep the changes specific for your own application and if it is really interested about the frequent loading and unloading model in the future, we could consider to optimize it.
from ctranslate2.
Thank you for the reply. :)
Creating a Translator instance with persist_cpu_cache=True avoids the need to call the Translator.unload_model(to_cpu=True) and then the Translator.load_model methods in order to make sure every next call to these methods is optimized.
But maybe my code really favors my use case too heavily.
I will check if your idea & code works as intended and consider changing the PR.
from ctranslate2.
@minhthuc2502 I changed the PR according to your suggestion. WDYT?
#1645
from ctranslate2.
Related Issues (20)
- Output tokens logits HOT 2
- Asynchronous execution: High latency when retrieving results HOT 1
- ValueError: Vocabulary has size 32001 but the model expected a vocabulary of size 32000 HOT 5
- nning this: HOT 2
- Likely bug: log_prob is not affected by sampling_temperature HOT 1
- Please add Unbabel Tower - Translation Model support
- When is flash attention supported HOT 3
- Please support gemma arch HOT 8
- Problem with GPU allocation after updating to CTranslate2 4.0.0 HOT 1
- [INFO] cuDNN with fusions HOT 1
- Whisper models no longer work on Windows cuda after CTranslate2 4.0.0 HOT 2
- Intel Advanced Matrix Extensions (AMX) support HOT 1
- System error while converting Aya101 HOT 4
- Update the pip package HOT 1
- PROBLEM converting phi2, phi1.5, and phi1 models HOT 8
- `translate_iterable` is not properly handling `max_input_length` HOT 2
- Whisper batch generation is not faster than loops HOT 5
- Support for Zephyr and other "StableLmForCausalLM" models? HOT 2
- Benchmarking common LLMs on ctranslate2, llama.cpp, and bitsandbytes HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ctranslate2.