faceit

A deep convolutional face detector in PyTorch.

1. Overview

Faceit is a joint model trained on UMDFaces that takes color 128x128 images and predicts:

1. whether contains a face

2. gender

3. pose (yaw, pitch, roll)

4. face bounding box

5. eyes.

2. Architecture

The model feeds the input image through a long central trunk of blocks with skip connections to the five branches, one per output. This "trunk" squashes the images with (sometimes strided) convolutions, then flattens and does some affine transformations, resulting in an embedding vector.

        self.features = nn.Sequential(
            conv_bn_pool(3, k),

            IsoConvBlock(k),
            IsoConvBlock(k),
            IsoConvBlock(k),

            ReduceBlock(k),

            IsoConvBlock(k),
            IsoConvBlock(k),
            IsoConvBlock(k),

            ReduceBlock(k),

            IsoConvBlock(k),
            IsoConvBlock(k),
            IsoConvBlock(k),

            ReduceBlock(k),

            IsoConvBlock(k),
            IsoConvBlock(k),
            IsoConvBlock(k),

            ReduceBlock(k),

            IsoConvBlock(k),
            IsoConvBlock(k),
            IsoConvBlock(k),

            ReduceBlock(k),

            IsoConvBlock(k),
            IsoConvBlock(k),
            IsoConvBlock(k),

            ReduceBlock(k),

            Flatten(),

            IsoDenseBlock(k),
            IsoDenseBlock(k),
            IsoDenseBlock(k),

            IsoDenseBlock(k),
            IsoDenseBlock(k),
            IsoDenseBlock(k),

            IsoDenseBlock(k),
            IsoDenseBlock(k),
            IsoDenseBlock(k),

            IsoDenseBlock(k),
            IsoDenseBlock(k),
            IsoDenseBlock(k),
        )

        self.is_face = nn.Sequential(
            IsoDenseBlock(k),
            IsoDenseBlock(k),
            IsoDenseBlock(k),

            nn.Linear(k, 1),
            nn.Sigmoid(),
        )

        self.is_male = nn.Sequential(
            IsoDenseBlock(k),
            IsoDenseBlock(k),
            IsoDenseBlock(k),

            nn.Linear(k, 1),
            nn.Sigmoid(),
        )

        self.get_pose = nn.Sequential(
            IsoDenseBlock(k),
            IsoDenseBlock(k),
            IsoDenseBlock(k),

            nn.Linear(k, 3),
            Degrees(),
        )

        self.get_face_bbox = nn.Sequential(
            IsoDenseBlock(k),
            IsoDenseBlock(k),
            IsoDenseBlock(k),

            IsoDenseBlock(k),
            IsoDenseBlock(k),
            IsoDenseBlock(k),

            nn.Linear(k, 4),
        )

        self.get_keypoints = nn.Sequential(
            IsoDenseBlock(k),
            IsoDenseBlock(k),
            IsoDenseBlock(k),

            IsoDenseBlock(k),
            IsoDenseBlock(k),
            IsoDenseBlock(k),

            nn.Linear(k, 4),
        )

3. Blocks

A block is collection of pathways paired with multipliers/"switches"/"gates" that are learned. The isomorphic blocks have skip connections and the reduce blocks have pooling pathways, which allows you to stack them as deep as you want. What I think is neat about this design is as follows: the pathways have different levels of complexity. You can monitor the switches during training to see how much it has to rely on the slower-to-learn complex pathways vs the skip connections. This allows you to tune architecture depth. This could be done automatically to grow a network from scratch as the switches tell you it is begging for additional capacity. Furthermore, you could use switch monitoring to go the other direction: retraining a network with a block removed at a time, with precise information about how capacity is holding up, to speed performance in tight ennvironments. I really think this direction should be explored <> but no time for now.

Three kinds of blocks:

Convolutional isomorphic (initialized with weights [1, 0, 0]):

Skip connection
Convolution -- conv2d, batch norm, relu, optionally dropout
Gated convolution (a primitive form of attention -- see Gated Convolutional Networks paper)

Fully connected isomorphic (initialized with weights [1, 0, 0, 0]):

Skip connection
Affine transformation -- linear, batch norm, relu, dropout
Gated affine (affine equivalent of gated conv)

Convolutional reduce (initialized with weights [0.5, 0.5, 0, 0]):

Average pooling
Max pooling (in theory you might worry about overfitting, but in practice I removed dropout)
Strided convolution
Gated strided convolution

I like this design because you can invent new pathways, and see very quickly how much the model relies on it, or not.

Other pathways I experimented with:

Fully-connected block: Two outputs; multiply their sqrt of n + 1 into one output. The scaling is to fight nans/gradient explosion. More general/powerful than sigmoid multiply gating? Perhaps having to have two outputs coincide in such a manner is akin to dropout and may have similar properties?
Iso conv block: do global max/average pooling, then affine transform that (dimensionality: num filters x num filters). The model loved this information, but did not converge faster. Dimensionality may have been too low, further experiments needed.
...

4. Losses

Losses were selected and balanced empirically.

Whether face: binary cross-entropy / 4
Gender: binary cross-entropy / 4
Pose: mean absolute error / 32 per float
Face bounding box: clamp the predicted coordinates to avoid gradient explosion, then take the Euclidean distance / 32. Was originally fancier.
Eyes: see #4.

Gender prediction accuracy as well as average bounding box and eye distance in pixels are also collected during training. It gets down to about mean 3 pixels for eyes and mean 4 pixels for face bounding box on validation pretty quickly on GPU. I was satisfied with that performance looking at demo images during training and don't have the resources to really experiment. Because of the weird stacked gating, it's easier to squeeze more performance by adding more layers but it's dubious how much additional tweaking on the same dataset would generalize so I didn't bother.

5. Dataset

The excellent UMDFaces dataset was used, without data augmentation due to time. Experimenting on a harder set of videos of people driving cars, I leawrned to randomly darken input images during training to ameliorate performance at nighttime. The effects of this were not quantified due to time constraints. To improve performance with various occlusions such as glasses, I also tried retraining with the Specs on Faces face detection dataset, although it's a bit small.

knighton / faceit Goto Github PK

faceit's Introduction

faceit

1. Overview

1. whether contains a face

2. gender

3. pose (yaw, pitch, roll)

4. face bounding box

5. eyes.

2. Architecture

3. Blocks

4. Losses

5. Dataset

faceit's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent