Giter VIP home page Giter VIP logo

Comments (6)

tomtung avatar tomtung commented on August 27, 2024

Hi, first please make sure that you're using the binary compiled from Rust directly instead of the Python wrapper. Currently to keep the Python API ergonomic, a redundant copy of the dataset is kept in memory, which is fine for smaller datasets, but could be problematic for really large ones.

I'm a bit surprised that you couldn't even finish initialization. I added some addtional logging to facilitate debugging; could you try installing the latest source, and see how far it gets? You can do so by running cargo install -f --git https://github.com/tomtung/omikuji.git --features cli.

Once initialization passes, you can try using the new --train_trees_1_by_1 flag I just added. Additionally, you can also consider limiting --n_threads; the program by default utilizes all avilable cores, but higher parallelization could also cause high memory usage.

from omikuji.

klimentij avatar klimentij commented on August 27, 2024

Thank you for a quick reply!

you're using the binary compiled from Rust directly instead of the Python wrapper

Yes, I use CLI compiled from source. I use Python binding just for the inference on new data.

I'm a bit surprised that you couldn't even finish initialization

I'm not 100% sure it happened because of Omikuji: I was connected through SSH and the only output I saw was Broken pipe. So it could be some SSH-related timeout.

Yesterday I re-ran this training on this time on 3.4T Ram instance, and this time I executed the command as a background bash job (with &). It finished tree initialization, but it took 6 hours. By the way, it would be awesome to have a progress bar at this stage as well.

Right now it's been running for 26 hours and it's on 90% of forest training.

Thank you for the update you've made: I'll reinstall your tool before the next training and will try --train_trees_1_by_1 flag and will play with --n_threads.

Also, just as a suggestion. Do you think it would be possible/feasible to implement checkpoints? Right now the cost of training on my dataset is about $600 with this instance, and if something happens before the model is saved, that's a painful waste of resources.

from omikuji.

tomtung avatar tomtung commented on August 27, 2024

It finished tree initialization, but it took 6 hours. By the way, it would be awesome to have a progress bar at this stage as well.

Yep you might have noticed that this has already been added. When I get around to it I could also try improving parallelization on this part; I didn't expect it to take as long as 6 hours.

Do you think it would be possible/feasible to implement checkpoints?

In principle it should be possible, but would require some fairly significant refactoring & redesign. I guess we could also aim for something simpler, e.g. saving the initialized trainer, or saving each tree immediately after it's trained, but that would still require some refactoring. I probably can't get to it at the moment, but you're welcome to take a stab at it :)

from omikuji.

klimentij avatar klimentij commented on August 27, 2024

And the last question regarding usage on large datasets. That yesterday's training process was finished successfully in 26 hours and I'm very happy with precision@5 I got on my test set.

But there's a new interesting issue: the resulting model consists of 3 trees 120Gb each, and it takes 70 minutes to load the model (in fact, it requires around over 624Gb of memory to finish loading the model for inference, since it failed to load on 624Gb instance).

I tried loading model to Python, then model.densify_weights(0.05) and saved the model, but it didn't seem to affect the model size (I even ended up with bigger model size for some reason, >140Gb per tree).

Also, I understand I can leave only one tree in the model folder, but it's still 120Gb and there's quite a performance drop when I do it (tested only on smaller 2M dataset).

Is model.densify_weights the right place to start if I want to optimize the model? Do you have any tips and tricks how to decrease memory usage and loading time for inference in production?

from omikuji.

tomtung avatar tomtung commented on August 27, 2024

Unfortuntely for now I can't really think of any way to further speed up model loading... I guess we could first load the entire files into memory, then parallelize deserializing individual trees, but that would probably make the memory usage problem even worse.

Calling densify_weights would indeed only increase the model size (often with the benefit of faster prediction).

You could try increasing --linear.weight_threshold during training to more aggressively prune out more weights, but this might cause noticable performance drop. I can also try add support for pruning trained models if you think that would be useful.

I might eventually try support some sparsity-inducing loss function like L1-normalized SVM, but that will take time and might be tricky too. (E.g., according to Babbar & Schölkopf 2019 the LibLinear solver underfits and they suggested using proximal gradient method instead, which I suspect could be much slower.)

Out of curiosity, could you tell me a bit more about your use case? Particularly, do you need to regularly retrain the model? If so, I could try prioritize speeding up the initialization process, as I assume shaving off 6hrs from 26hrs would be quite significant.

from omikuji.

tomtung avatar tomtung commented on August 27, 2024

Closing for now due to inactivity, feel free to re-open if you have more questions.

from omikuji.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.