Hi Tom! At first, I wanted to thank you for your great contribution. This is the best

Issues when training on a large dataset about omikuji HOT 6 CLOSED

klimentij commented on August 27, 2024

Issues when training on a large dataset

from omikuji.

Comments (6)

tomtung commented on August 27, 2024

Hi, first please make sure that you're using the binary compiled from Rust directly instead of the Python wrapper. Currently to keep the Python API ergonomic, a redundant copy of the dataset is kept in memory, which is fine for smaller datasets, but could be problematic for really large ones.

I'm a bit surprised that you couldn't even finish initialization. I added some addtional logging to facilitate debugging; could you try installing the latest source, and see how far it gets? You can do so by running cargo install -f --git https://github.com/tomtung/omikuji.git --features cli.

Once initialization passes, you can try using the new --train_trees_1_by_1 flag I just added. Additionally, you can also consider limiting --n_threads; the program by default utilizes all avilable cores, but higher parallelization could also cause high memory usage.

from omikuji.

klimentij commented on August 27, 2024

Thank you for a quick reply!

you're using the binary compiled from Rust directly instead of the Python wrapper

Yes, I use CLI compiled from source. I use Python binding just for the inference on new data.

I'm a bit surprised that you couldn't even finish initialization

I'm not 100% sure it happened because of Omikuji: I was connected through SSH and the only output I saw was Broken pipe. So it could be some SSH-related timeout.

Yesterday I re-ran this training on this time on 3.4T Ram instance, and this time I executed the command as a background bash job (with &). It finished tree initialization, but it took 6 hours. By the way, it would be awesome to have a progress bar at this stage as well.

Right now it's been running for 26 hours and it's on 90% of forest training.

Thank you for the update you've made: I'll reinstall your tool before the next training and will try --train_trees_1_by_1 flag and will play with --n_threads.

Also, just as a suggestion. Do you think it would be possible/feasible to implement checkpoints? Right now the cost of training on my dataset is about $600 with this instance, and if something happens before the model is saved, that's a painful waste of resources.

from omikuji.

tomtung commented on August 27, 2024

It finished tree initialization, but it took 6 hours. By the way, it would be awesome to have a progress bar at this stage as well.

Yep you might have noticed that this has already been added. When I get around to it I could also try improving parallelization on this part; I didn't expect it to take as long as 6 hours.

Do you think it would be possible/feasible to implement checkpoints?

In principle it should be possible, but would require some fairly significant refactoring & redesign. I guess we could also aim for something simpler, e.g. saving the initialized trainer, or saving each tree immediately after it's trained, but that would still require some refactoring. I probably can't get to it at the moment, but you're welcome to take a stab at it :)

from omikuji.

klimentij commented on August 27, 2024

And the last question regarding usage on large datasets. That yesterday's training process was finished successfully in 26 hours and I'm very happy with precision@5 I got on my test set.

But there's a new interesting issue: the resulting model consists of 3 trees 120Gb each, and it takes 70 minutes to load the model (in fact, it requires around over 624Gb of memory to finish loading the model for inference, since it failed to load on 624Gb instance).

I tried loading model to Python, then model.densify_weights(0.05) and saved the model, but it didn't seem to affect the model size (I even ended up with bigger model size for some reason, >140Gb per tree).

Also, I understand I can leave only one tree in the model folder, but it's still 120Gb and there's quite a performance drop when I do it (tested only on smaller 2M dataset).

Is model.densify_weights the right place to start if I want to optimize the model? Do you have any tips and tricks how to decrease memory usage and loading time for inference in production?

from omikuji.

tomtung commented on August 27, 2024

Unfortuntely for now I can't really think of any way to further speed up model loading... I guess we could first load the entire files into memory, then parallelize deserializing individual trees, but that would probably make the memory usage problem even worse.

Calling densify_weights would indeed only increase the model size (often with the benefit of faster prediction).

You could try increasing --linear.weight_threshold during training to more aggressively prune out more weights, but this might cause noticable performance drop. I can also try add support for pruning trained models if you think that would be useful.

I might eventually try support some sparsity-inducing loss function like L1-normalized SVM, but that will take time and might be tricky too. (E.g., according to Babbar & Schölkopf 2019 the LibLinear solver underfits and they suggested using proximal gradient method instead, which I suspect could be much slower.)

Out of curiosity, could you tell me a bit more about your use case? Particularly, do you need to regularly retrain the model? If so, I could try prioritize speeding up the initialization process, as I assume shaving off 6hrs from 26hrs would be quite significant.

from omikuji.

tomtung commented on August 27, 2024

Closing for now due to inactivity, feel free to re-open if you have more questions.

from omikuji.

Issues when training on a large dataset about omikuji HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent