Face Recognition

Face recognition involves identifying or verifying a person from a digital image or video frame and is still one of the most challenging tasks in computer vision today. The conventional face recognition pipeline consists of face , , and . This page further explains three exemplary state-ofdetection face alignment feature extraction, classification the-art architectures: DeepID3 , FaceNet , and Sparse ConvNet . (6) (9) (11)


Introduction
The task of face recognition involves identifying or verifying a person from a digital image or video frame. Computer applications capable of performing this task, known as facial recognition systems, have been around for decades. The general idea of face recognition is identifying facial features by extracting and then compare facial landmarks to other images by matching those features.
However, face recognition is still one of the most relevant and challenging research areas in computer vision and pattern recognition due to variations in facial expressions, poses, and illumination. (1) Overview The conventional face recognition pipeline consists of four stages: , (or face detection face alignment preprocessing), (or face feature extraction representation) and , as illustrated in classification figure 1.
A milestone in the face detection areas was the contribution by Viola & Jones in 2001, which (2) provided an object detection framework that was operating in real-time and was suited for human faces. The feature extraction part is often considered the most challenging and important of all, since any matching algorithm is limited by the quality of the underlying features.

Notable networks
There is a verity of successful architectures. This section focuses on three different models and explains their idiosyncrasies. Evaluations for face recognition approaches are almost always performed on the Labeled Face in data set, with as the most common metric. In the verification task, the Wild (LFW) (12) face verification accuracy given a pair of face images, the goal is to determine whether they are coming from a single subject or not.

DeepID3
DeepID3 is the third generation of the DeepID architecture, which was one of the first publications to propose learning discriminative deep face representations (DFR) through large-scale face identity classification. The second generation proposed DFR by joint face identification-verification, which finally brought the networks up to human performance.
In this third approach (shown in ), Sun et. al figure 2 (6) were trying to use insights of the most successful architectures from the ImageNet challenge in 2014: the inception layers of GoogLeNet and stacked (7) convolutions of VGG . They also included joint (8) identification-verification supervisory signals to multiple layers, to further reduce the intra-personal variance of the representation. The publication shows that very deep neural networks achieve state-of-the-art performance on face recognition tasks and slightly outperform their shallow counterparts. By exposing the architectures to large-scale training data, another increase in effectiveness is expected.  This net Model structure. work consists of a batch input layer and a deep CNN followed by L2 normalization, which results in the face embedding. This is followed by the triplet loss during training. S ource: (9) The architecture is a combination of the multiple interleaved layers of convolutions of Zeiler & Fergus (10) and the inception model of GoogLeNet . These models are (7) interwoven to a deep architecture, which is symbolized as a black box in f The most important part igure 4. of the approach lies in the end-toend learning of the whole system. As a loss function, the Triplet Loss was used, which is explained and shown in figure 5.

Sparse ConvNet
In this recent publication, Sun et al.
. tried to further improve their achievements of DeepID3 . by taking a (11) trained, dense , sparsify the connections, and train it even further to improve performance. This architecture CNN increases the baseline performance of the DeepID3 from 98.95% to 99.30%, which implies an error rate reduction of 33%. It is important to note that even if it did not achieve a better performance than FaceNet , it only required (9) 300,000 training samples and can thereby be considered more efficient. .