Review on A Paper that Combines Gabor Filter and Convolutional Neural Networks for Face Detection

Posted by Mohamad Ivan Fanany

Printed version

This writing summarizes and reviews a paper that combines Gabor filters and convolutional neural networks:Face Detection Using Convolutional Neural Networks and Gabor Filters for detecting facial regions in the image of arbitrary size.

Motivation:

  • Detecting and locating human faces in an image or video has many applications:
    • human-computer interaction,
    • model based coding of video at very low bitrates
    • content-based video indexing.
  • Why Gabor filter?
    • Biologically motivated since it models the response of huban visual cortical cells [3],
    • Remove most of variation in lighting and contrast,
    • Reduce intrapersonal variation,
    • Robust against small shifts and small object deformations,
    • Allow analysis of signals at different scales, or resolution,
    • Accommodate frequency and position simultaneously.
  • Why convolutional neural networks [6]?
    • Incorporates prior knowledge about the input signal and its distortions into its architecture.
    • Specifically designed to cope with the variability of 2D shapes to be recognized.
    • Combine local feature fields and shared weights
    • Utilize spatial subsampling to ensure some level of shift, scale and deformation invariance.
    • Using the local receptive fields the neurons can extract simple visual features such as corners, end-points. These elementary features are then linked by the succeeding layers to detect more complicated features.

Purpose:

  • Build a method for detecting facial regions in the image of arbitrary size.

Key Ideas:

  • Combining a Gabor filter and a convolutional neural network.
  • Uses Gabor filter-based features instead of the raw gray values as the input for a convolutional neural network.

Stages:

  1. Uses the Gabor filter which extracts intrinsic facial features. As a result of this transformation we obtain four subimages.
  2. Apply the convolutional neural network to the four images obtained.

Challenges:

  • In complex scenes, human faces may appear in different
    • scales,
    • orientations,
    • head poses.
  • Human face appearance could change considerably due to change of
    • lighting condition,
    • facial expressions,
    • shadows,
    • presence of glasses.

Previous Works:

  • Facial regions detection:
    • Support vector machines [10],
    • Bayesian classifiers [10],
    • Neural networks [7][5].
  • Face knowledge-based detector [7]
  • Finding frontal faces [8]
  • Gabor filter-based features for face recognition [10] but no proposed method for face detection.
  • Faces detection in static images using CNN [5][2]

Features Extraction:

  • Two different orientations and two different wavelengths are utilized.
  • Different facial features are selected, depending on the response of each filter.
  • In frontal or near frontal face image the eyes and mouth are oriented horizontally, while the nose constitutes vertical orientation.
  • The Gabor wavelet capable to select localized variation in image intensity.

Convolutional Neural Algorithm:

  • Contains a set of layers each of which consists of one or more planes.
  • Each unit in the plane is connected to a local neighborhood in the previous layer.
  • The unit can be seen as a local feature detector whose activation characteristic is determined in the learning stage.
  • The outputs of such a set of units constitute a feature map.
  • Units in a feature map are constrained to perform the same operation on different parts of the input image or previous feature maps, extracting different features from the same image.
  • A feature map can be obtained in a sequential manner through scanning the input image.
  • The scanning operation is equivalent to a convolution with a small kernel.
  • The feature map can be treated as a plane of units that share weights.
  • The subsampling layers introduce a certain level of invariance to distortions and translations.
  • Features detected by the units in the successive layers are:
    • decreasing spatial resolution
    • increasing complexity
    • increasing globality.
  • Training the network in a supervised manner using the back-propagation algorithm which has been adapted for convolutional neural networks.
  • The partial derivatives of the activation function with respect to each connection have been computed, as if the network were a typical multi-layer one.
  • The partial derivatives of all the connections that share the same parameter have been added to construct the derivative with respect to that parameter.

Convolutional Neural Architecture:

  • 6 layers.
  • Layer C1 (convolutional layer 1):
    • Performs a convolution on the Gabor filtered images using an adaptive mask.
    • The weights in the convolution mask are shared by all the neurons of the same feature map.
    • The receptive fields of neighboring units overlap.
    • The size of the scanning windows was chosen to be 20×20 pixels.
    • The size of the mask is 5×5
    • The size of the feature map of this layer is 16×16.
    • The layer has 104 trainable parameters.
  • Layers S1 (subsampling layer 1):
    • The averaging/subsampling layer.
    • partially connected to C2.
    • The task is to discover the relationships between different features.
  • Layer C2 (convolutional layer 2):
    • Composed of 14 feature maps.
    • Each unit contains one or two receptive fields of size 3×3 which operate at identical positions within each S1 maps.
    • The first eight feature maps use single receptive fields.
    • Form two independent groups of units responsible for distinguishing between face and nonface patterns.
    • The remaining six feature maps take inputs from every contiguous subsets of two feature maps in S1. This layer has 140 free parameters.
  • Layer S2 (subsampling layer 2):
    • Consists of of 4 planes of size 16 by 16.
    • Each unit in one of these planes receives four inputs from the corresponding plane in C1.
    • Receptive fields do not overlap and all the weights are equal within a single unit.
    • Therefore, this layer performs a local averaging and 2 to 1 subsampling.
    • The number of trainable parameters utilized in this layer is 8.
    • Once a feature has been extracted through the first two layers its accurate location in the image is less substantial and spatial relations with other features are more relevant.
    • Plays the same role as the layer S1.
    • It is constructed of 14 feature maps and has 28 free parameters.
    • In the next layer each of 14 units is connected only to the corresponding feature map of the S2 layer. It has 140 free parameters.
    • The output layer has one node that is fully connected to the all the nodes from the previous layer.
    • The network contains many connections but relatively few free trained parameters due to weight sharing.
    • Weight sharing:
      • Considerably reduce the number of free parameters
      • Improves the generalization capability.

Validation:

  • The recognition performance is dependent on the size and quality of the training set.
  • The face detector was trained on 3000 non-face patches collected from about 1500 images and 1500 faces covering out-of-plane rotation in the range 20º,,20º.
  • All faces were manually aligned by eyes position.
  • For each face example the synthesized faces were generated by random in-plane rotation in the range 10º,,10º, random scaling about ±10%, random shifting up to ±1 pixel and mirroring.
  • All faces were cropped and rescaled to windows of size 20×20 pixels while preserving their aspect ratio
  • Such a window size is considered in the literature as the minimal resolution that can be used without loosing critical information from the face pattern.
  • The training collection contains also images acquired from our video cameras.
  • The most of the training images which were obtained from WWW are of very good quality.
  • The images obtained from cameras are of second quality,
  • To provide more false examples, perform a training with bootstrapping [7].
  • By using bootstrapping we iteratively gathered examples which were close to the boundaries of face and non-face clusters in the early stages of training.
  • The activation function in the network was a hyperbolic tangent.
  • Training the face detector took around 60 hours on a 2.4 GHz Pentium IV-based PC.
  • There was no overlap between the training and test images.

Testing Experiments:

  • Camera sensor: binocular Megapixel Stereo Head (for testing).
  • A skin color detector is the first classifier in our system [4].
  • To find the faces the detector moves a scanning subwindow by a pre-determined number of pixels within only skin-like regions.
  • The output of the face detector is then utilized to initialize our face/head tracker [4].
  • The detector operates on images of size 320×240 and can process 2-5 images per second depending on the image structure.
  • To estimate the recognition performance, the paper used only the static gray images.

Results:

  • Test data-set containing 1000 face sampes and 10000 non-face samples
  • Obtained detection rate is 87.5%.
  • Using only the convolutional network the detection rate is only 79%.
  • The used network structure is relatively simple to provide face detection in real-time using the available computational resources.
  • It is much easier to train a convolutional neural network using a Gabor filtered input images than a network which uses raw images or histogram equalized images.

Conclusions:

  • The experimental results is promising both in detection rates and processing speed.
  • The Gabor filter captures efficient features for a convolutional neural network.
  • Much better recognition performance than using the convolutional neural network alone.
  • Achieves high face detection rates and real-time performance due to no exhaustive searching on the whole image.

My review:

  • Even though quite old (ICANN 2005), the paper is interesting because it is the first paper that combines Gabor wavelet filters as input to convolutional neural networks.
  • The paper stated that the recognition result with Gabor filtered input is much better than convolutional neural networks alone.
  • According to Steffan Duffner dissertation (on 2007) on page 120, however, stated as follows:

Our experimental results show that using the image intensities as input of the CNN yields to the best results compared to gradient images and Gabor wavelet filter responses.

  • Thus, the results seem contradict the finding of Steffan Duffner dissertation.
  • It is natural that we come up with more questions on this:
    • What is the effect of using stereo camera during testing? How if using monocular camera?
    • Does Gabor filter will cause overfitting?
  • The overhead computation time of computing the Gabor filter for each training and testing images seems hinders the wide use of this system. One breakthrough may be achieved if we can compute the Gabor filters directly inside the CNN, hence inline with the training process. We can hope that such mechanism will also maintained good results from the system.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s