Hello everyone. First of all, thank you for your great work, you have an amazing p

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Translator.unload_model(to_cpu=True) takes long time about ctranslate2 HOT 8 CLOSED

anterart commented on June 7, 2024

Translator.unload_model(to_cpu=True) takes long time

from ctranslate2.

Comments (8)

minhthuc2502 commented on June 7, 2024

Hello, I think the data transfer from Host to Device is more optimized than Device to Host. Additionally, the time for model transfer includes also the time for allocation data. As far as I know, the allocation on CPU takes more time than once on GPU, so it is not surprising that the unloading process takes more time.

I found some tests here

from ctranslate2.

anterart commented on June 7, 2024

@minhthuc2502 thanks for the reply.
Since posting the issue I went to study the code a bit and found something that can solve the issue in my use case.
I noticed that every call for Translator.load_model or Translator.unload_model performs _cached_models.clear().
If I understand correctly, the existing of non-empty _cached_models are responsible for quick quick loading of the translator to host.
I thought that if it's possible to make clearing _cached_models optional (maybe add additional optional argument clear_cache) then it will make the following possible:

During initialization: Translator.unload_model(to_cpu=True) - this will still be slow which is ok.
During request and before inference: Translator.load_model(clear_cache=False) - will be optimized because we have non empty _cached_models and also it will not delete the _cached_models
During request and after inference Translator.unload_model(clear_cache=False) will unload the model fast and keep the cache to load the model fast next time.

Here are the relevant lines in the code:

CTranslate2/python/cpp/translator.cc

Line 299 in bfa0cb3

void unload_model(const bool to_cpu) {

If I did undestand the code correctly, then it doesn't matter that the transfer from host to device is less optimized, since it will happen only on first time during the initialization, and then we won't need to transfer it anymore since it will exist in _cached_models.

from ctranslate2.

minhthuc2502 commented on June 7, 2024

Yes in your use case, you can keep the unloaded model in cache to avoid unloading it all time in the future. Only loading model is needed.

from ctranslate2.

anterart commented on June 7, 2024

It was a bit tougher then I thought, but I think I got it working.
I had to write a new method to clone the cached models instead of moving them.
So now it works in the following way:
load model:

loads the models to gpu
moves the models to cpu
clones the models while they are in cpu
moves the models clones to gpu

unload_model:
Never moves the models.
If it is called with to_cpu=True, just clears the cahced models in gpu, else: clears both in cpu and gpu.

It still works a bit funny though. These are now the steps I need to do in order to initialize the model:

translator = Translator(model_path, "cuda", compute_type) #  (takes ~6 seconds)
translator.unload_model(to_cpu=True) # (takes a few ms)
translator.load_model() # (takes ~14 seconds)

And from here on it works as I intended, no matter how many times I do it (which is what I wanted 😁):

translator.unload_model(to_cpu=True) # (takes a few ms)
translator.load_model() # (takes ~1.5 seconds)

And it unloads fully when I call: translator.unload_model()

I'm a bit rusty on my C++, so the code might be not suitable for everyone, but these are the changes I introduced to the module python/cpp/translator.cc:

      void unload_model(const bool to_cpu) {
        if (to_cpu && _device == Device::CPU)
          return;

        // Do not unload the model if some batches are still being processed.
        if (_pool->num_active_batches() > 0)
          return;

        // If the lock is not acquired immediately it means the model is being used
        // in another thread and we can't unload it at this time.
        std::unique_lock lock(_mutex, std::try_to_lock);
        if (!lock || !_model_is_loaded)
          return;

        _gpu_cached_models = _pool->detach_models();
        if (!to_cpu)
          _cpu_cached_models.clear();
        else
          _gpu_cached_models.clear();

        // We clear the CUDA allocator cache to further reduce the memory after unloading the model.
        if (_device == Device::CUDA)
          _pool->clear_cache();

        _model_is_loaded = false;
      }

      void load_model() {
        std::unique_lock lock(_mutex);
        if (_model_is_loaded)
          return;

        if (_cpu_cached_models.empty()) {
          _cpu_cached_models = _model_loader.load();
          move_cached_models(Device::CPU, std::vector<int>(_cpu_cached_models.size(), 0), _cpu_cached_models);
          _gpu_cached_models = _model_loader.load();
        } else {
          _gpu_cached_models = copy_cached_models(_device, _device_index);
          move_cached_models(_device, _device_index, _gpu_cached_models, _num_replicas_per_device);
        }

        _pool->set_models(_gpu_cached_models);
        _gpu_cached_models.clear();
        _model_is_loaded = true;
      }

    private:
      const Device _device;
      const std::vector<int>& _device_index;
      const size_t _num_replicas_per_device;
      std::vector<std::shared_ptr<const models::Model>> _cpu_cached_models;
      std::vector<std::shared_ptr<const models::Model>> _gpu_cached_models;
      bool _model_is_loaded;

      // Use a shared mutex to protect the model state (loaded/unloaded).
      // Multiple threads can read the model at the same time, but a single thread can change
      // the model state (e.g. load or unload the model).
      std::shared_mutex _mutex;

      void assert_model_is_ready() const {
        if (!_model_is_loaded)
          throw std::runtime_error("The model for this translator was unloaded");
      }

      void move_cached_models(Device device,
                              const std::vector<int>& device_index,
                              std::vector<std::shared_ptr<const models::Model>> _cached_models,
                              size_t num_models_per_device = 1) {
        for (size_t i = 0; i < _cached_models.size(); ++i) {
          auto& model = const_cast<models::Model&>(*_cached_models[i]);
          model.set_device(device, device_index[i / num_models_per_device]);
        }
      }

      std::vector<std::shared_ptr<const models::Model>> copy_cached_models(Device device,
                                                                           const std::vector<int>& device_index,
                                                                           size_t num_models_per_device = 1) {
        std::vector<std::shared_ptr<const models::Model>> copied_models;
        for (size_t i = 0; i < _cpu_cached_models.size(); ++i) {
          auto& model = const_cast<models::Model&>(*_cpu_cached_models[i]);
          auto copied_model = model.copy_to(device, device_index[i / num_models_per_device]);
          copied_models.push_back(copied_model);
        }
        return copied_models;
      }
    };

I would be happy if you can suggest how can I avoid calling translator.load_model() directly after invoking the Translator constructor? I couldn't understand which code is executed when the constructor is called and how can I make it do the same actions that my updated load_model method does?
Also if you have any other suggestions I would be happy to hear 😁.

from ctranslate2.

anterart commented on June 7, 2024

I improved the code quality and also implement it in a way that preserves also the old behavior by adding a flag for persistent_cpu_cache.
I opened a PR for this change.
#1645

from ctranslate2.

minhthuc2502 commented on June 7, 2024

Hello, I think this PR is specific to your use case only. The behavior becomes more complex. The parameter persist_cpu_cache is not relevant when a Translator is created. It makes more sense when loading or unloading a model. As far as I understand, you can just avoid cache clearing when loading model, something like this:

      void load_model(bool keep_cache) {
        std::unique_lock lock(_mutex);
        if (_model_is_loaded)
          return;

        if (_cached_models.empty()) {
          _cached_models = _model_loader.load();
        } else {
          move_cached_models(_device, _device_index, _num_replicas_per_device);
        }

        _pool->set_models(_cached_models);
        if (!keep_cache)
          _cached_models.clear();
        _model_is_loaded = true;
      }

      void unload_model(const bool to_cpu) {
        if (to_cpu && _device == Device::CPU)
          return;

        // Do not unload the model if some batches are still being processed.
        if (_pool->num_active_batches() > 0)
          return;

        // If the lock is not acquired immediately it means the model is being used
        // in another thread and we can't unload it at this time.
        std::unique_lock lock(_mutex, std::try_to_lock);
        if (!lock || !_model_is_loaded)
          return;
       
        if (_cached_models.empty) {
          _cached_models = _pool->detach_models();
          if (to_cpu)
            move_cached_models(Device::CPU, std::vector<int>(_cached_models.size(), 0));
          else
            _cached_models.clear();
        }

        // We clear the CUDA allocator cache to further reduce the memory after unloading the model.
        if (_device == Device::CUDA)
          _pool->clear_cache();

        _model_is_loaded = false;
      }

I think you can keep the changes specific for your own application and if it is really interested about the frequent loading and unloading model in the future, we could consider to optimize it.

from ctranslate2.

anterart commented on June 7, 2024

Thank you for the reply. :)
Creating a Translator instance with persist_cpu_cache=True avoids the need to call the Translator.unload_model(to_cpu=True) and then the Translator.load_model methods in order to make sure every next call to these methods is optimized.
But maybe my code really favors my use case too heavily.
I will check if your idea & code works as intended and consider changing the PR.

from ctranslate2.

anterart commented on June 7, 2024

@minhthuc2502 I changed the PR according to your suggestion. WDYT?
#1645

from ctranslate2.

Translator.unload_model(to_cpu=True) takes long time about ctranslate2 HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent