Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
[2.1] 3D Imaging, Analysis and Applications-Springer-Verlag London (2012).pdf
Скачиваний:
12
Добавлен:
11.12.2021
Размер:
12.61 Mб
Скачать

8 3D Face Recognition

323

Fig. 8.5 Block diagram showing typical processing steps in a 3D face recognition system. There are several possible reorderings of this pipeline, depending on the input data quality and the performance priorities of the application

gallery, along with a larger set of the general public. If a probe matches reasonably well to any one of these criminal identities, such that the match score is ranked in the top ten, then this can trigger an alarm for a manual inspection of the probe image and top 10 gallery matches. In this case the rank-10 identification rate is important.

In practice, curves are generated by recording the rank of the correct match and then counting the percentage of identifications that are less than or equal to r , where r is the allowable rank. The allowable rank starts at 1 and is incremented until 100 % recognition is attained for some r , or the graph may be terminated before then, for example r = 100. Plotting such a Cumulative Match Curve (CMC) of percentage identification against rank allows us to compare systems at a range of possible operating points, although rank-1 identification is the most important of these.

At the time of writing, high performance 3D face recognition systems are reporting rank-1 identification rates that typically range from around 96 % to 98.5 % [51, 64, 76] on the FRGC v2 3D face dataset. We note that the FRGC v2 3D face dataset does not contain significant pose variations and performance of identification systems may fall as more challenging large scale datasets are developed that do contain such variations.

8.5 Processing Stages in 3D Face Recognition

When developing a 3D face recognition system, one has to understand what information is provided from the camera or from the dataset, what format it is presented in and what imperfections are likely to exist. The raw data obtained from even the most accurate scanners is imperfect as it contains spikes, holes and noise. Preprocessing stages are usually tailored to the form and quality of this raw data. Often, the face scans must be normalized with respect to pose (e.g. holistic approaches) and spatial sampling before extracting features for 3D face recognition. Although the 3D face processing pipeline shown in Fig. 8.5 is typical, many variations on this exist;

324

A. Mian and N. Pears

in particular, there are some possible reorderings and not all of the preprocessing and pose normalization stages are always necessary. With this understanding, we discuss all of the stages of the pipeline in the following subsections.

8.5.1 Face Detection and Segmentation

Images acquired with a 3D sensor usually contain a larger area than just the face area and it is often desirable to segment and crop this extraneous data as early as possible in the processing pipeline in order to speed up processing in the downstream sections of the pipeline. This face detection and cropping process, which yields 3D face segmentation, can be done on the basis of the camera’s 3D range data, 2D texture image or a combination of both.

In the case of 2D images, face detection is a mature field (particularly for frontal poses) and popular approaches include skin detection, face templates, eigenfaces, neural networks, support vector machines and hidden Markov models. A survey of face detection in images is given by Yang et al. [92]. A seminal approach for realtime face detection by Viola and Jones [88] is based on Haar wavelets and adaptive boosting (Adaboost) and is part of the Open Computer Vision (OpenCV) library.

However, some face recognition systems prefer not to rely on the existence of a 2D texture channel in the 3D camera data and crop the face on the basis of 3D information only. Also use of 3D information is sometimes preferred for a more accurate localization of the face. If some pose assumptions are made, it is possible to apply some very basic techniques. For example, one could take the upper most vertex (largest y value), assume that this is near to the top of the head and crop a sufficient distance downwards from this point to include the largest faces likely to be encountered. Note that this can fail if the upper most vertex is on a hat, other head accessory, or some types of hair style. Alternatively, for co-operative subjects in frontal poses, one can make the assumption that the nose tip is the closest point to the camera and crop a spherical region around this point to segment the facial area. However, the chin, forehead or hair is occasionally closer. Thus, particularly in the presence of depth spikes, this kind of approach can fail and it may be better to move the cropping process further down the processing pipeline so that it is after a spike filtering stage.

If the system’s computational power is such that it is acceptable to move cropping even further down the pipeline, more sophisticated cropping approaches can be applied, which could be based on facial feature localization and some of the techniques described earlier for 2D face detection. The nose is perhaps the most prominent feature that has been used alone [64], or in combination with the inner eye corners, for face region segmentation [22]. The latter approach uses the principal curvatures to detect the nose and eyes. The candidate triplet is then used by a PCA-based classifier for face detection.

8 3D Face Recognition

325

8.5.2 Removal of Spikes

Spikes are caused mainly by specular regions. In the case of faces, the eyes, nose tip and teeth are three main regions where spikes are likely to occur. The eye lens sometimes forms a real image in front of the face causing a positive spike. Similarly, the specular reflection from the eye forms an image of the laser behind the eye causing a negative spike. Shiny teeth seem to be bulging out in 3D scans and a small spike can sometimes form on top of the nose tip. Glossy facial makeup or oily skin can also cause spikes at other regions of the face. In medical applications such as craniofacial anthropometry, the face is powdered to make its surface Lambertian and the teeth are painted before scanning. Some scanners like the Swiss Ranger [84] also gives a confidence map along with the range and grayscale image. Removing points with low confidence will generally remove spikes but will result in larger regions of missing data as points that are not spikes may also be removed. Spike detection works on the principle that surfaces, and faces in particular, are generally smooth.

One simple approach to filtering spikes is to examine a small neighborhood for each point in the mesh or range image and replace its depth (Z-coordinate value) by the median of this small neighborhood. This is a standard median filter which, although effective, can attenuate fine surface detail. Another approach is to threshold the absolute difference between the point’s depth and the median of the depths of its neighbors. Only if the threshold is exceeded is the point’s depth replaced with the median, or deleted to be filled later by a more sophisticated scheme. These approaches work well in high resolution data, but in sufficiently low resolution data, problems may occur when the facial surface is steep relative to the viewing angle, such as the sides of the nose in frontal views. In this case, we can detect spikes relative to the local surface orientation, but this requires that surface normals are computed, which are corrupted by the spikes. It is possible to adopt an iterative procedure where surface normals are computed and spikes removed in cycles, yielding a clean, uncorrupted set of surface normals even for relatively low resolution data [72]. Although this works well for training data, where processing time is noncritical, it may be too computationally expensive for live test data.

8.5.3 Filling of Holes and Missing Data

In addition to the holes resulting from spike removal, the 3D data contains many other missing points due to occlusions, such as the nose occluding the cheek when, for example, the head pose is sufficiently rotated (in yaw angle) relative to the 3D camera. Obviously, such areas of the scene that are not visible to either the camera or the projected light can not be acquired. Similarly, dark regions which do not reflect sufficient projected light are not sensed by the 3D camera. Both can cause large regions of missing data, which are often referred to as missing parts.

In the case of cooperative subject applications (e.g. a typical verification application) frontal face images are acquired and occlusion is not a major issue. However,

326

A. Mian and N. Pears

dark regions such as the eyebrows and facial hair are usually not acquired. Moreover, for laser-based projection, power cannot be increased to acquire dark regions of the face due to eye-safety reasons.

Thus the only option is to fill the missing regions using an interpolation technique such as nearest neighbor, linear or polynomial. For small holes, linear interpolation gives reasonable results however, bicubic interpolation has shown to give better results [64]. Alternatively one can use implicit surface representations for interpolation, such as those provided by radial basis functions [72].

For larger size holes, symmetry or PCA based approaches can be used, although these require a localized symmetry plane and a full 6-DOF rigid alignment respectively, and hence would have to be moved further downstream in the processing pipeline. Alternatively, a model-based approach can be used to morph a model until it gives the best fit to the data points [95]. This approach learns the 3D face space offline and requires a significantly large training data in order to generalize to unseen faces. However, it may be very useful if the face being scanned was previously seen and obviously this is a common scenario in 3D face recognition.

For better generalization to unseen faces, an anthropometrically correct [32] annotated face model (AFM) [51] is used. The AFM is based on an average 3D face constructed using statistical data and the anthropometric landmarks are associated with its vertices. The AFM is then fitted to the raw data from the scanner using a deformable model framework. Blanz et al. [12] also used a morphable model to fit the 3D scan, filling up missing regions in addition to other preprocessing steps, in a unified framework.

8.5.4 Removal of Noise

For face scans acquired by laser scanners, noise can be attributed to optical components such as the lens and the mirror, or mechanical components which drive the mirror, or the CCD itself. Scanning and imaging conditions such as ambient light, laser intensity, surface orientation, texture, and distance from the scanner can also affect the noise levels in the scanner. Sun et al. [89] give a detailed analysis of noise in the Minolta Vivid scanner and noise in active sensors is discussed in detail in Chap. 3 of this book.

We have already mentioned the median filter as a mechanism for spike (impulsive noise) removal. More difficult is the removal of surface noise, which is less differentiated from the underlying object geometry, without removing fine surface detail or generally distorting the underlying shape, for example, from volumetric shrinkage. Clearly, there is a tradeoff to be made and an optimal level of filtering is sought in order to give the best recognition performance. Removal of surface noise is particularly important as a preprocessing step in some methods of extraction of the differential properties of the surface, such as normals and curvatures.

If the 3D data is in range image or depth map form, there are many methods available from the standard 2D image filtering domain, such as convolution with