I have uploaded the occlusion- and pose-RAFDB list, you can see at RAF_DB_dir. Thank you for your kindly waiting.

Our manuscript has been accepted by Transactions on Image Processing as a REGULAR paper! link

Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition

                              Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and Yu Qiao
                          Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
                                     {kai.wang, xj.peng, db.meng, yu.qiao}@siat.ac.cn

Abstract

Occlusion and pose variations, which can change facial appearance signiﬁcantly, are among two major obstacles for automatic Facial Expression Recognition (FER). Though automatic FER has made substantial progresses in the past few decades, occlusion-robust and pose-invariant issues of FER have received relatively less attention, especially in real-world scenarios.Thispaperaddressesthereal-worldposeandocclusionrobust FER problem with three-fold contributions. First, to stimulate the research of FER under real-world occlusions and variant poses, we build several in-the-wild facial expression datasets with manual annotations for the community. Second, we propose a novel Region Attention Network (RAN), to adaptively capture the importance of facial regions for occlusion and pose variant FER. The RAN aggregates and embeds varied number of region features produced by a backbone convolutional neural network into a compact ﬁxed-length representation. Last, inspired by the fact that facial expressions are mainly deﬁned by facial action units, we propose a region biased loss to encourage high attentionweightsforthemostimportantregions.Weexamineour RAN and region biased loss on both our built test datasets and four popular datasets: FERPlus, AffectNet, RAF-DB, and SFEW. Extensive experiments show that our RAN and region biased loss largely improve the performance of FER with occlusion and variant pose. Our methods also achieve state-of-the-art results on FERPlus, AffectNet, RAF-DB, and SFEW.

Region Attention Network

we propose the Region Attention Network (RAN), to capture the importance of facial regions for occlusion and pose robust FER. The RAN is comprised of a feature extraction module, a self-attention module, and a relation attention module. The proposed RAN mainly consists of two stages. The first stage is to coarsely calculate the importance of each region by a FC layer conducted on its own feature, which is called self-attention module. The second stage seeks to find more accurate attention weights by modeling the relation between the region features and the aggregated content representation from the first stage, which is called relation-attention module. The latter two modules aim to learn coarse attention weights and refine them with global context, respectively. Given a number of facial regions, our RAN learns attention weights for each region in an end-to-end manner, and aggregates their CNN-based features into a compact fixed-length representation. Besides, the RAN model has two auxiliary effects on the face images. On one hand, cropping regions can enlarge the training data which is important for those insufficient challenging samples. On the other hand, rescaling the regions to the size of original images highlights fine-grain facial features.

Region Biased Loss

Inspired by the observation that different facial expressions are mainly defined by different facial regions, we make a straightforward constraint on the attention weights of self-attention, i.e. region biased loss (RB-Loss). This constraint enforces that one of the attention weights from facial crops should be larger than the original face image with a margin. Formally, the RB-Loss is defined as, where is a hyper-parameter served as a margin, is the attention weight of the copy face image, denotes the maximum weight of all facial crops.

Region Generation

Confused Metrics

The confusion matrices of baseline methods and our RAN on the Occlusion- and Pose-FERPlus test sets.

The confusion matrices of baseline methods and our RAN on the Occlusion- and Pose-AffectNet test sets.

What is learned for occlusion and pose variant faces?

Illustration of learned attention weights for different regions along with origianl faces. denotes the softmax function. Red-filled boxes indicate the highest weights while blue-filled ones are the lowest weights. From left to right, the columns represent the original faces, regions to .Note that the left and right figures show the weights by use the PBLoss or not respectively.