Yes I really want to run this program on large-scale datasets, but I really don't know

Here's my dependencies: <div class="snippet-clipboard-content notranslate position

RuntimeError: CUDA out of memory. about gkt HOT 8 CLOSED

erving666 commented on September 25, 2024

RuntimeError: CUDA out of memory.

from gkt.

Comments (8)

jhljx commented on September 25, 2024

you can use small batch size to train the model, maybe batch-size=32 is to big.

from gkt.

erving666 commented on September 25, 2024

you can use small batch size to train the model, maybe batch-size=32 is to big.

I set batch-size=16.

But I got this log from 59th batch：
...
batch idx: 57 loss kt: 0.6516776084899902 auc: 0.5851154181184669 acc: 0.6264667535853976 cost time: 6.809619903564453
batch idx: 58 loss kt: 0.6162782907485962 auc: 0.5378101714226197 acc: 0.7085820127598271 cost time: 11.871150255203247
batch idx: 59 loss kt: nan auc: -1 acc: -1 cost time: 5.788291692733765
batch idx: 60 loss kt: nan auc: -1 acc: -1 cost time: 4.6798787117004395
...
batch idx: 112 loss kt: nan auc: -1 acc: -1 cost time: 1.3512508869171143
batch idx: 113 loss kt: nan auc: -1 acc: -1 cost time: 6.595600843429565

（Looks like it‘s not gonna be an end.）

Is there something wrong with the code? I am still working on this.

from gkt.

erving666 commented on September 25, 2024

you can use small batch size to train the model, maybe batch-size=32 is to big.

I set batch-size=16.
But I got this log from 59th batch： ... batch idx: 57 loss kt: 0.6516776084899902 auc: 0.5851154181184669 acc: 0.6264667535853976 cost time: 6.809619903564453 batch idx: 58 loss kt: 0.6162782907485962 auc: 0.5378101714226197 acc: 0.7085820127598271 cost time: 11.871150255203247 batch idx: 59 loss kt: nan auc: -1 acc: -1 cost time: 5.788291692733765 batch idx: 60 loss kt: nan auc: -1 acc: -1 cost time: 4.6798787117004395 ... batch idx: 112 loss kt: nan auc: -1 acc: -1 cost time: 1.3512508869171143 batch idx: 113 loss kt: nan auc: -1 acc: -1 cost time: 6.595600843429565
（Looks like it‘s not gonna be an end.）
Is there something wrong with the code? I am still working on this.

I want to add your wechat，Because I also encountered this problem

Great！It's “waves99”.

from gkt.

jhljx commented on September 25, 2024

The reason why the kt loss value is nan is:

the learning rate is too big, making gradient vanish
there is something wrong in the training data

Do you run this code on the data set we provide?

from gkt.

erving666 commented on September 25, 2024

The reason why the kt loss value is nan is:

the learning rate is too big, making gradient vanish

there is something wrong in the training data

Do you run this code on the data set we provide?

Yes, I did use the dataset you provide.

And this is my hyper-parameter：("learning rate" is 0.01)

nohup: ignoring input
Namespace(attn_dim=32, batch_size=16, bias=True, binary=True, cuda=True, data_dir='data', data_file='skill_builder_data.csv', dkt_graph='dkt_graph.txt', dkt_graph_dir='dkt-graph', dropout=0, edge_types=2, emb_dim=32, epochs=50, factor=True, gamma=0.5, graph_save_dir='graphs', graph_type='Dense', hard=False, hid_dim=32, load_dir='', lr=0.01, lr_decay=200, model='GKT', no_cuda=False, no_factor=False, prior=False, result_type=12, save_dir='logs', seed=42, shuffle=True, temp=0.5, test=False, test_model_dir='logs/expDKT', train_ratio=0.6, vae_decoder_dim=32, vae_encoder_dim=32, val_ratio=0.2, var=1)
max seq_len:  6157
student num:  4047
feature_dim:  246
question_dim:  123
train_size:  2428 val_size:  809 test_size:  810

from gkt.

jhljx commented on September 25, 2024

That's strange. I've run the code with the provided dataset, and it ran smoothly. Maybe you can check the versions of python dependencies？

from gkt.

erving666 commented on September 25, 2024

Here's my dependencies:

torch                             1.7.0+cu110
scikit-learn                      1.0.1
scipy                             1.7.3
pandas                            1.2.2
numpy                             1.18.5

And yours:

pip3 install numpy==1.17.4 pandas==1.1.2 scipy==1.5.2 scikit-learn==0.23.2 torch==1.4.0

AND By the way，“GKT” is working alright with the dataset “assistment_test15” :

Namespace(attn_dim=32, batch_size=64, bias=True, binary=True, cuda=True, data_dir='data', data_file='assistment_test15.csv', dkt_graph='dkt_graph.txt', dkt_graph_dir='dkt-graph', dropout=0, edge_types=2, emb_dim=32, epochs=50, factor=True, gamma=0.5, graph_save_dir='graphs', graph_type='Dense', hard=False, hid_dim=32, load_dir='', lr=0.001, lr_decay=200, model='GKT', no_cuda=False, no_factor=False, prior=False, result_type=12, save_dir='logs', seed=42, shuffle=True, temp=0.5, test=False, test_model_dir='logs/expDKT', train_ratio=0.6, vae_decoder_dim=32, vae_encoder_dim=32, val_ratio=0.2, var=1)
max seq_len:  368
student num:  15
feature_dim:  148
question_dim:  74
train_size:  9 val_size:  3 test_size:  3
……
……
Best Epoch: 0047
--------------------------------
--------Testing-----------------
--------------------------------
loss_test: 0.6181263328 auc_test: 0.5657202216 acc_test: 0.6813819578

Looks like there's something wrong with the dataset "skill_builder". Maybe

from gkt.

jhljx commented on September 25, 2024

Maybe you can use my python library version, especially numpy, pandas and scipy.

from gkt.

RuntimeError: CUDA out of memory. about gkt HOT 8 CLOSED

Comments (8)

Related Issues (6)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent