- Temperature scale (T):
Z = f(X)
O = softmax(Z / T)
- Alpha
given: data - X, label - Y
---------------------------------------------------------
Z_teacher = teacher(X)
Z_student = student(X)
O_teacher = softmax(Z_teacher / T)
O_student = softmax(Z_student / T)
---------------------------------------------------------
# O_teacher * log(O_student)
loss1 = CrossEntropy(O_student, O_teacher)
# Y * log(O_student)
loss2 = CrossEntropy(softmax(Z_student / 1), Y)
---------------------------------------------------------
loss = alpha * (T ** 2) * loss1 + (1 - alpha) * loss2
- Train Teacher
python teacher_train.py --batch-size 64 --lr 0.01 --num-epochs 10 --device 'cuda:0'
- Train Student