Giter VIP home page Giter VIP logo

Comments (8)

minhthuc2502 avatar minhthuc2502 commented on June 7, 2024

Hello, I think the data transfer from Host to Device is more optimized than Device to Host. Additionally, the time for model transfer includes also the time for allocation data. As far as I know, the allocation on CPU takes more time than once on GPU, so it is not surprising that the unloading process takes more time.

I found some tests here

from ctranslate2.

anterart avatar anterart commented on June 7, 2024

@minhthuc2502 thanks for the reply.
Since posting the issue I went to study the code a bit and found something that can solve the issue in my use case.
I noticed that every call for Translator.load_model or Translator.unload_model performs _cached_models.clear().
If I understand correctly, the existing of non-empty _cached_models are responsible for quick quick loading of the translator to host.
I thought that if it's possible to make clearing _cached_models optional (maybe add additional optional argument clear_cache) then it will make the following possible:

  1. During initialization: Translator.unload_model(to_cpu=True) - this will still be slow which is ok.
  2. During request and before inference: Translator.load_model(clear_cache=False) - will be optimized because we have non empty _cached_models and also it will not delete the _cached_models
  3. During request and after inference Translator.unload_model(clear_cache=False) will unload the model fast and keep the cache to load the model fast next time.

Here are the relevant lines in the code:

void unload_model(const bool to_cpu) {

If I did undestand the code correctly, then it doesn't matter that the transfer from host to device is less optimized, since it will happen only on first time during the initialization, and then we won't need to transfer it anymore since it will exist in _cached_models.

from ctranslate2.

minhthuc2502 avatar minhthuc2502 commented on June 7, 2024

Yes in your use case, you can keep the unloaded model in cache to avoid unloading it all time in the future. Only loading model is needed.

from ctranslate2.

anterart avatar anterart commented on June 7, 2024

It was a bit tougher then I thought, but I think I got it working.
I had to write a new method to clone the cached models instead of moving them.
So now it works in the following way:
load model:

  1. loads the models to gpu
  2. moves the models to cpu
  3. clones the models while they are in cpu
  4. moves the models clones to gpu

unload_model:
Never moves the models.
If it is called with to_cpu=True, just clears the cahced models in gpu, else: clears both in cpu and gpu.

It still works a bit funny though. These are now the steps I need to do in order to initialize the model:

translator = Translator(model_path, "cuda", compute_type) #  (takes ~6 seconds)
translator.unload_model(to_cpu=True) # (takes a few ms)
translator.load_model() # (takes ~14 seconds)

And from here on it works as I intended, no matter how many times I do it (which is what I wanted 😁):

translator.unload_model(to_cpu=True) # (takes a few ms)
translator.load_model() # (takes ~1.5 seconds)

And it unloads fully when I call: translator.unload_model()

I'm a bit rusty on my C++, so the code might be not suitable for everyone, but these are the changes I introduced to the module python/cpp/translator.cc:

      void unload_model(const bool to_cpu) {
        if (to_cpu && _device == Device::CPU)
          return;

        // Do not unload the model if some batches are still being processed.
        if (_pool->num_active_batches() > 0)
          return;

        // If the lock is not acquired immediately it means the model is being used
        // in another thread and we can't unload it at this time.
        std::unique_lock lock(_mutex, std::try_to_lock);
        if (!lock || !_model_is_loaded)
          return;

        _gpu_cached_models = _pool->detach_models();
        if (!to_cpu)
          _cpu_cached_models.clear();
        else
          _gpu_cached_models.clear();

        // We clear the CUDA allocator cache to further reduce the memory after unloading the model.
        if (_device == Device::CUDA)
          _pool->clear_cache();

        _model_is_loaded = false;
      }

      void load_model() {
        std::unique_lock lock(_mutex);
        if (_model_is_loaded)
          return;

        if (_cpu_cached_models.empty()) {
          _cpu_cached_models = _model_loader.load();
          move_cached_models(Device::CPU, std::vector<int>(_cpu_cached_models.size(), 0), _cpu_cached_models);
          _gpu_cached_models = _model_loader.load();
        } else {
          _gpu_cached_models = copy_cached_models(_device, _device_index);
          move_cached_models(_device, _device_index, _gpu_cached_models, _num_replicas_per_device);
        }

        _pool->set_models(_gpu_cached_models);
        _gpu_cached_models.clear();
        _model_is_loaded = true;
      }

    private:
      const Device _device;
      const std::vector<int>& _device_index;
      const size_t _num_replicas_per_device;
      std::vector<std::shared_ptr<const models::Model>> _cpu_cached_models;
      std::vector<std::shared_ptr<const models::Model>> _gpu_cached_models;
      bool _model_is_loaded;

      // Use a shared mutex to protect the model state (loaded/unloaded).
      // Multiple threads can read the model at the same time, but a single thread can change
      // the model state (e.g. load or unload the model).
      std::shared_mutex _mutex;

      void assert_model_is_ready() const {
        if (!_model_is_loaded)
          throw std::runtime_error("The model for this translator was unloaded");
      }

      void move_cached_models(Device device,
                              const std::vector<int>& device_index,
                              std::vector<std::shared_ptr<const models::Model>> _cached_models,
                              size_t num_models_per_device = 1) {
        for (size_t i = 0; i < _cached_models.size(); ++i) {
          auto& model = const_cast<models::Model&>(*_cached_models[i]);
          model.set_device(device, device_index[i / num_models_per_device]);
        }
      }

      std::vector<std::shared_ptr<const models::Model>> copy_cached_models(Device device,
                                                                           const std::vector<int>& device_index,
                                                                           size_t num_models_per_device = 1) {
        std::vector<std::shared_ptr<const models::Model>> copied_models;
        for (size_t i = 0; i < _cpu_cached_models.size(); ++i) {
          auto& model = const_cast<models::Model&>(*_cpu_cached_models[i]);
          auto copied_model = model.copy_to(device, device_index[i / num_models_per_device]);
          copied_models.push_back(copied_model);
        }
        return copied_models;
      }
    };

I would be happy if you can suggest how can I avoid calling translator.load_model() directly after invoking the Translator constructor? I couldn't understand which code is executed when the constructor is called and how can I make it do the same actions that my updated load_model method does?
Also if you have any other suggestions I would be happy to hear 😁.

from ctranslate2.

anterart avatar anterart commented on June 7, 2024

I improved the code quality and also implement it in a way that preserves also the old behavior by adding a flag for persistent_cpu_cache.
I opened a PR for this change.
#1645

from ctranslate2.

minhthuc2502 avatar minhthuc2502 commented on June 7, 2024

Hello, I think this PR is specific to your use case only. The behavior becomes more complex. The parameter persist_cpu_cache is not relevant when a Translator is created. It makes more sense when loading or unloading a model. As far as I understand, you can just avoid cache clearing when loading model, something like this:

      void load_model(bool keep_cache) {
        std::unique_lock lock(_mutex);
        if (_model_is_loaded)
          return;

        if (_cached_models.empty()) {
          _cached_models = _model_loader.load();
        } else {
          move_cached_models(_device, _device_index, _num_replicas_per_device);
        }

        _pool->set_models(_cached_models);
        if (!keep_cache)
          _cached_models.clear();
        _model_is_loaded = true;
      }

      void unload_model(const bool to_cpu) {
        if (to_cpu && _device == Device::CPU)
          return;

        // Do not unload the model if some batches are still being processed.
        if (_pool->num_active_batches() > 0)
          return;

        // If the lock is not acquired immediately it means the model is being used
        // in another thread and we can't unload it at this time.
        std::unique_lock lock(_mutex, std::try_to_lock);
        if (!lock || !_model_is_loaded)
          return;
       
        if (_cached_models.empty) {
          _cached_models = _pool->detach_models();
          if (to_cpu)
            move_cached_models(Device::CPU, std::vector<int>(_cached_models.size(), 0));
          else
            _cached_models.clear();
        }

        // We clear the CUDA allocator cache to further reduce the memory after unloading the model.
        if (_device == Device::CUDA)
          _pool->clear_cache();

        _model_is_loaded = false;
      }

I think you can keep the changes specific for your own application and if it is really interested about the frequent loading and unloading model in the future, we could consider to optimize it.

from ctranslate2.

anterart avatar anterart commented on June 7, 2024

Thank you for the reply. :)
Creating a Translator instance with persist_cpu_cache=True avoids the need to call the Translator.unload_model(to_cpu=True) and then the Translator.load_model methods in order to make sure every next call to these methods is optimized.
But maybe my code really favors my use case too heavily.
I will check if your idea & code works as intended and consider changing the PR.

from ctranslate2.

anterart avatar anterart commented on June 7, 2024

@minhthuc2502 I changed the PR according to your suggestion. WDYT?
#1645

from ctranslate2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.