An unofficial Pytorch Implementation for Attention Augmented Convolutional Networks
The network structure is as below:
The result in paper is as below:
Paper said "a minimum of 20 dimensions per head for the keys", the dimensions of keys for heads depends on dk
Before add this, the sum of parameters is 35.8M, and now the sum of parameters is 36.3M, seems right, not sure