Triplet Attention

Authors - Diganta Misra ^1†, Trikay Nalamada ^1,2†, Ajay Uppili Arasanipalai ^1,3†, Qibin Hou ⁴

1 - Landskape 2. IIT Guwahati 3. University of Illinois, Urbana Champaign 4. National University of Singapore

† - Denotes Equal Contribution

Abstract - Benefiting from the capability of building inter-dependencies among channels or spatial locations, attention mechanisms have been extensively studied and broadly used in a variety of computer vision tasks recently. In this paper, we investigate light-weight but effective attention mechanisms and present triplet attention, a novel method for computing attention weights by capturing cross-dimension interaction using a three-branch structure. For an input tensor, triplet attention builds inter-dimensional dependencies by the rotation operation followed by residual transformations and encodes inter-channel and spatial information with negligible computational overhead. Our method is simple as well as efficient and can be easily plugged into classic backbone networks as an add-on module. We demonstrate the effectiveness of our method on various challenging tasks including image classification on ImageNet-1k and object detection on MSCOCO and PASCAL VOC datasets. Furthermore, we provide extensive in-sight into the performance of triplet attention by visually inspecting the GradCAM and GradCAM++ results. The empirical evaluation of our method supports our intuition on the importance of capturing dependencies across dimensions when computing attention weights.

Figure 1. (a). Squeeze Excitation Block. (b). Convolution Block Attention Module (CBAM) (Note - GMP denotes - Global Max Pooling). (c). Global Context (GC) block. (d). Triplet Attention (ours).

Figure 2. GradCAM and GradCAM++ comparisons for ResNet-50 based on sample images from ImageNet dataset.

For generating GradCAM and GradCAM++ results, please follow the code on this repository.

Pretrained Models:

ImageNet:

Model	Parameters	GFLOPs	Top-1 Error	Top-5 Error	Weights
ResNet-18 + Triplet Attention (k = 3)	11.69 M	1.823	29.67%	10.42%	Google Drive
ResNet-18 + Triplet Attention (k = 7)	11.69 M	1.825	28.91%	10.01%	Google Drive
ResNet-50 + Triplet Attention (k = 7)	25.56 M	4.169	22.52%	6.326%	Google Drive
ResNet-50 + Triplet Attention (k = 3)	25.56 M	4.131	23.88%	6.938%	Google Drive
MobileNet v2 + Triplet Attention (k = 3)	3.506 M	0.322	27.38%	9.23%	Google Drive
MobileNet v2 + Triplet Attention (k = 7)	3.51 M	0.327	28.01%	9.516%	Google Drive

MS-COCO:

All models are trained with 1x learning schedule.

Detectron2:

Object Detection:

Backbone	Detectors	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L	Weights
ResNet-50 + Triplet Attention (k = 7)	Faster R-CNN	39.2	60.8	42.3	23.3	42.5	50.3	Google Drive
ResNet-50 + Triplet Attention (k = 7)	RetinaNet	38.2	58.5	40.4	23.4	42.1	48.7	Google Drive
ResNet-50 + Triplet Attention (k = 7)	Mask RCNN	39.8	61.6	42.8	24.3	42.9	51.3	Google Drive

Instance Segmentation

Backbone	Detectors	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L	Weights
ResNet-50 + Triplet Attention (k = 7)	Mask RCNN	35.8	57.8	38.1	18	38.1	50.7	Google Drive

Person Keypoint Detection

Backbone	Detectors	AP	AP₅₀	AP₇₅	AP_M	AP_L	Weights
ResNet-50 + Triplet Attention (k = 7)	Keypoint RCNN	64.7	85.9	70.4	60.3	73.1	Google Drive

BBox AP results using Keypoint RCNN:

Backbone	Detectors	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L	Weights
ResNet-50 + Triplet Attention (k = 7)	Keypoint RCNN	54.8	83.1	59.9	37.4	61.9	72.1	Google Drive

MMDetection:

Object Detection:

Backbone	Detectors	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L	Weights
ResNet-50 + Triplet Attention (k = 7)	Faster R-CNN	39.3	60.8	42.7	23.4	42.8	50.3	Google Drive
ResNet-50 + Triplet Attention (k = 7)	RetinaNet	37.6	57.3	40.0	21.7	41.1	49.7	Google Drive

Cite our work:

@misc{misra2020rotate,
      title={Rotate to Attend: Convolutional Triplet Attention Module}, 
      author={Diganta Misra and Trikay Nalamada and Ajay Uppili Arasanipalai and Qibin Hou},
      year={2020},
      eprint={2010.03045},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

mymuli / triplet-attention Goto Github PK