October 2018

Volume 33 Number 10

GPU Programming - Face Detection Using the Eigenfaces Algorithm on the GPU

By Kishore Mulchandani

Computer vision is an area of computer science that involves the identification or labeling of regions in an image. Images are large 2-D arrays of pixels that can be operated upon in parallel. GPUs are ideally suited for accelerating algorithms that entail large amounts of data parallelism, a kind of parallelism in which different portions of a dataset can be operated upon at the same time on multiple processors. Therefore, it should be possible to apply GPUs to the task of implementing efficient computer vision algorithms. In this article I’ll use the GPU to implement facial detection, a popular problem in computer vision.

Face detection, as the name suggests, discovers areas in a photo or image that correspond to a human face. This is a critical prerequisite of applications that deal with applying a filter or edit to a face in an image. Automatic blurring, masking, highlighting and enhancing are examples that require knowing which pixels should be modified to complete the operation.

There are several algorithms currently in use to detect faces, but in this article, I’m going to utilize the eigenfaces method, one of the earliest published face detection methods. It’s also computationally intensive. My goal is to demonstrate the use of GPUs to accelerate a computationally demanding algorithm, and to present a high-level introduction to programming on the GPU.

The article assumes some familiarity with C-like languages and general concepts of linear algebra and image processing.


From linear algebra you know that if you have a collection of n-dimensional points, you can find some principal components, or eigenvectors, for this collection. These eigenvectors also form the basis of the space of these points. The eigenvector of the highest importance is the one with the largest eigenvalue. Any new point that belongs to this collection should also have a higher projection value on the most important eigenvector.

The eigenfaces method proceeds by converting a collection of faces into n-dimensional points. This is done by converting each two-dimensional image into a one-dimensional vector. The entire collection is packed into a large matrix, where each column is the 1-D vector. Eigenvectors and eigenvalues are found for the square matrix formed by multiplying this matrix with its transpose. The discovery of eigenvectors for a collection of points is also called principal component analysis, and the eigenvectors are called principal components.

The starting point for any face detection algorithm is a good database of faces. There are a few available on the Internet, but for this article I made use of the publicly available “Labeled Faces in the Wild” collection (vis-www.cs.umass.edu/lfw). A derivative of the collection converted to grayscale is available for download at conradsanderson.id.au/lfwcrop and is also provided with the download accompanying this article. This collection consists of 13,233 64x64 grayscale images of faces. Figure 1 shows a subset of the images. Note that all the images have been cropped to include only the face.

The First 100 Images from the Labeled Faces in the Wild Dataset
Figure 1 The First 100 Images from the Labeled Faces in the Wild Dataset

A packed matrix, in which each column in the matrix is a 1-D face image from the dataset, is shown in Figure 2. Notice the high level of intensity match along the rows of this image.

Scaled Representation of the Big Matrix with 4,096 Rows and 13,233 Columns
Figure 2 Scaled Representation of the Big Matrix with 4,096 Rows and 13,233 Columns

After the eigenvectors and eigenvalues are found, the most important eigenvector when reshaped as a 2-D image looks like the blurry face in Figure 3.

First Eigenface of the Big Matrix
Figure 3 First Eigenface of the Big Matrix

Once the first eigenface has been calculated, the next step is to search the test images for regions where the projection with the eigenface reaches a large value. This large value indicates the test image region and the eigenface are near each other in the n-dimensional space of faces. The search regions will be 64x64 pixel areas starting at every pixel of the test image. Each search is independent of the search at other starting pixels, meaning there’s a high level of data parallelism.

As noted earlier, GPUs excel at accelerating parallel computations. But that doesn’t mean every algorithm should be implemented on the GPU. The system archi­tecture still has a relatively slow connection between system memory and GPU memory. The data that needs to be operated on must go through a PCI Express bus, which even at PCI-e Gen 4 per lane can transfer only up to 32GT/s, relatively a much slower transfer rate than that between the CPU and system memory. This means that applications that can’t tolerate high latency won’t run effectively on the GPU. The size of the data set must be large enough to justify the overhead of copying the data from the system memory to the GPU memory and back. If an application has a significantly large workload in terms of data size, the operations to be performed are similar in nature, and all the data needed to perform the task can fit in the GPU memory, then the GPU is an ideally suited processor for the application. As I’ll explain, in the case of face detection, the GPU is ideal for part of the processing.

The face detection algorithm has two distinct phases. The first phase consists of creating an eigenface matrix for a collection of faces. Because this collection doesn’t typically change, the computation for the eigenface matrix calculation needs to be performed only once per collection. The second phase is the projection phase, in which a test region is projected on the first eigenface to see how large the projection is. This phase tends to happen quite frequently. For example, in a security camera application this phase may occur at 30 frames per second, producing large amounts of data. The resolution of the images could also be very large as even the cheapest webcams nowadays support HD resolutions. Moreover, face detection programs are often running with other graphics operations, such as blurring or highlighting, and the results of the face detection are being consumed on the GPU by another part of the program.

Thus, phase one is not a good candidate for GPU implementation, given the low frequency with which it needs to be performed, while phase two is an excellent match for a GPU implementation as it satisfies all the criteria.

In the following sections I’ll show you how to do the phase two of face detection, but first let me expand on how programming the GPU differs from that on the CPU.

Programming the GPU

The paradigm of programming the GPU is quite different from that of the CPU. To the OS a GPU appears as a device, and its installed drivers manage the execution of work on this device. The OS and the GPU drivers work together to help complete the tasks. At the lowest level, a sequence of commands is sent to the GPU device through its command queues. The commands may involve setting state, copying data between CPU memory and GPU memory and running code. GPUs have their own instruction language, quite different from the CPU language. Luckily, that’s not something you need to know unless you’re developing GPU compilers. Most programmers use one or more APIs that GPU vendors implement in their drivers, such as Direct3D, OpenCL and OpenGL. These APIs hide the low-level complexity while providing portability and compatibility across different vendors and families of architectures.

For this article I’ve chosen to use OpenCL, a standard that’s supported by all the major CPU and GPU vendors. OpenCL has been around since 2009, has had several versions, the latest being version 2.2, and it continues to evolve (khronos.org/opencl). A very important fact to note about OpenCL is that it can target not only the GPU, but also the CPU. This is a very big plus for OpenCL as it means you can make use of all the compute resources on a system. For this reason, OpenCL is known as a parallel programming API for heterogeneous systems. An OpenCL host program can explore the availability and capabilities of the computational devices before deciding which ones to use.

Fortunately, the programming model and abstractions are very similar among the various APIs. This means you should be able to map the concepts learned in OpenCL to other APIs without difficulty. Let’s now dive into an OpenCL implementation for face detection.

Setting Up the Environment

To develop with OpenCL, in addition to the runtime support in the driver, you need the SDK from the appropriate vendor. See the documentation in the digital download accompanying this article.

Because OpenCL is an open standard, any vendor can provide their own version of the runtime optimized for their hardware.

The OpenCL programming vocabulary consists of platforms, devices and command queues. A platform is synonymous with an implementation available on a system. A device refers to a piece of hardware capable of being scheduled independently from other devices. Every computer has at least one CPU device. Most workstations or gaming-class computers will have a discrete GPU that will be listed as an additional device. High-end engineering workstations often have multiple GPUs, with one attached to display while others are reserved for computational use. Each device can have multiple command queues into which a program will insert commands.

Every OpenCL program, running on a host, begins by discovering the capabilities of the system on which it’s running. The host program tries to identify the platforms available on the system, then it enumerates the devices. The program can specifically request a device type, for instance a CPU device type or a GPU device type. Once a device has been chosen, you create a context and a command queue for each device you intend to use. Here are some of the calls made to perform the aforementioned operations (for clarity I’ve excluded the parameters required for the calls):

clGetPlatformIDs(...); // Gets the ids of the platforms available
clGetDeviceIDs(...); // Gets the devices with certain capabilities
clCreateContext(...); // Creates a context on the device
clCreateCommandQueueWithProperties(...); // Creates a command queue with
                                         // certain properties

Once the command queue is set up, the host program creates buffers on the device that can be read-only, write-only or read-write. The buffers must be initialized with data from system memory before they can be used to do anything meaningful:

clCreateBuffer(...);  // Creates a read only buffer on the device
clEnqueueWriteImage(...);  // Queues a buffer copy from the system to
                           // device memory

The data tends to be large buffers or arrays of integer and floating-point values and is usually copied asynchronously. The buffers where the computational results are to be stored are similarly created on the device. OpenCL has special data types called Images. Many of the intensive tasks usually originate in the form of images or 3-D datasets where these special datatypes come in handy. These types support various sampling operations, such as nearest sample or averaging.

The programs, or kernels as they’re referred to in OpenCL terminology, must be compiled and linked and queued in the command queue. I’ll elaborate on this aspect later, as a CPU programmer might wonder why the kernels are compiled at run time:

clCreateProgramWithSource(...); // Create a program from kernel source
clBuildProgram(...); // The program is compiled; errors from compilation
                     // are reported
clCreateKernel(...); // The kernel is created from the program

Once the kernel has been created, the arguments for the kernel are set up, and the kernel is then queued for execution in the command queue:

clSetKernelArg(...); // Set the value of a particular argument
clEnqueueNDRangeKernel(...); // Queue the kernel

There are two kernels in the accompanying source code, one that does the projection of a region on the eigenface, and the other that finds local peaks from the results of the projection. The projection operation is essentially a vector dot product.

Large values for the dot product imply more similarities between the region in the image and the eigenface. These large values or peaks are found using the second kernel. Because the results of the first kernel must be available before the second kernel can use them, OpenCL events are used to synchronize. The host program can wait on one or more events using the following API call:


Once both the kernels have been executed on the device, the results of the second kernel will be in a device buffer that’s then copied back to the host memory where it can be used further.

Now let’s look at one of the kernels that’s launched in Figure 4. This kernel is called for every pixel in the test image. It first finds the sum of the 4,096-dimensional vector (64x64) of the region starting at that pixel and then subtracts a factor of the sum from each pixel of the region. It then computes the dot product of this region with the eigenface that’s passed into the kernel as an argument.

Figure 4 The OpenCL Kernel for Creating an Eigenface Projection on a Region in the Test Image

__kernel void eigenFaceSearch(
    __read_only image2d_t testImage,        // Input test image
    __write_only image2d_t projectionImage, // Output projection image
    int rows,                               // Number of rows in the test image
    int cols,                               // Number of columns in the test image
    __constant float* eigenFace,            // The first eigenface buffer
    sampler_t sampler                       // Sampler for reading the image data
    int myCol = get_global_id(0);           // Column index of test image pixel
    int myRow = get_global_id(1);           // Row index of test image pixel
    float4 dotp = {0.0f, 0.0f, 0.0f, 0.0f}; // Resulting dot project variable
    float4 pixel = {0.0f, 0.0f, 0.0f, 0.0f};// Variable for reading the pixel
    float sum = 0.f;                        // Sum of the pixels in the test region
    float meanVal=0.f;
    float patch[4096];                      // Local buffer to hold the mean
                                            // subtracted region
    int2 myCoords;                          // Local variable for holding current
                                            // pixels coords
    myCoords.x = myCol;                    
    myCoords.y = myRow;
    if (!(( myCol > cols-64) || (myRow > rows - 64))) { // Bounds check
      for (int i = 0; i < 64; i++) { // Loop over every pixel in test region
        int2 coords;
        coords.y = myRow + i;
        for (int j = 0; j < 64; j++) {
          coords.x = myCol + j;
          // Read the pixel at coords, from the test image using the sampler
          pixel = read_imagef(testImage, sampler, coords); 
          patch[i*64+j] = pixel.x;  // Take the first component as it is a single
                                    // float buffer
          sum += patch[i*64+j];     // Accumulate the sum
    meanVal = sum / 256.;           // An intensity factor proportional to sum.
    for (int i= 0; i < 4096; i++){  // Remove from each pixel
      patch[i] -= meanVal;          // Per component subtract of intensity factor
    // Perform the dot product
    for (int i= 0; i < 4096; i++){
      dotp.x += patch[i] * eigenFace[i]; // Do the component-wise dot product
    // Write back the dot product into the projection Image
    write_imagef(projectionImage, myCoords, dotp);

Compiling the OpenCL Kernel

As mentioned previously, the OpenCL compiler is invoked at run time by the host program. The compiler is supplied with the entire source of the kernel as a string. Frequently, kernels are rather tiny pieces of code that compile quite fast, and unless your application is sensitive to every extra millisecond it takes to get the results back, you probably won’t care about it. But for complex applications with hundreds of kernels, the delay could be noticeable. Fortunately, intermediate language (IL) options exist that can reduce the overall compilation times. (IL also helps in protecting the intellectual property as embedded strings of kernel code might give away proprietary information.) The benefit of the runtime compilation of kernels is that it makes the application resilient to changes in hardware. It’s quite common for PC users to upgrade their GPUs during the lifespan of their computers. A new GPU with a new architecture might have a completely new instruction set, leaving precompiled binaries from an older GPU unable to work.

Piecing It Together

The eigenface image is precomputed and supplied as a binary file containing single-precision float data in a row-major order. The test images are also converted to floating-point grayscale images. The test images have the following format:

Height: 4 byte unsigned integer   
  // The number of rows in the image
Width: 4 byte unsigned integer    
  // The number of columns in the image
Intensity Buffer:   Height * width single precision floating point buffer
  // Data buffer

With such a simple format, it should be straightforward to generate test images that can be consumed by my program. The results of the computation are also saved out as buffers of floating-point data.  Additionally, a text file with coordinates of the top-left corners of all rectangles containing faces is generated. This data can be imported into any program. I save the data for consumption in MATLAB (bit.ly/2isajNJ).

The program can be easily modified to run on the CPU rather than the GPU.

Results of the Run

I ran the program on several test images. Because the code currently searches for only 64x64 regions, if the test images have much larger or smaller faces, I wouldn’t expect it to find good matches. However, when the test images do have approximately the same size as the eigenface, unless the faces are rotated, it should do well. Results of my run on one famous image are shown in Figure 5. Faces were identified using the eigenfaces method with unnormalized projection of region on the first eigenface.

A Photograph Taken at the Fifth Solvay Conference held in 1927
Figure 5 A Photograph Taken at the Fifth Solvay Conference held in 1927

I got quite a few false positives and false negatives. Several faces were missed, and many non-face areas were highlighted. Some of this is a result of the local maxima search method I used to detect the high values (peaks) in the projection image. I simply searched the neighborhood of 97x97 pixels to see if I’d find any higher value. The result was several areas were highlighted where there were no faces but still had a higher value than its local neighborhood. This method could be improved upon to reduce the false positives. 

Keep in mind that the quality of results depends on the quality of the example images, which includes things such as the image resolution, the lighting conditions and the variability in the intensities.

OpenCL on GPU vs. CPU

There are several parallel processing technologies on the CPU front, from SIMD vectorization, to explicit multithreading, to compiler pragmas in the OpenMP style parallelization of loops. To truly compare what’s faster or has higher throughput would be very difficult unless every possible combination of technologies was tested with extensively optimized implementations. That said, given how easy it is to switch from one device type to another in code, I decided to give it a try.

I ran the program on both device types, the CPU and the GPU, and from two different “platforms.” The system I ran this on was a gaming laptop with an Intel Core i7-8750H CPU @2.20 GHz with 16GB RAM with an NVIDIA GeForce GTX 1060 GPU. I could’ve chosen a CPU device on any of the three platforms, or a GPU device type on one from NVIDIA. Note that a GPU vendor might supply a CPU implementation along with its GPU implementation, in which case it’s less likely to be optimized.

As expected, the GPU performed better, with the CPU version taking 12 to 21 times longer, as shown in Figure 6. The profiling didn’t include any I/O but did include the time to copy the buffers from the system to the device memory. All times are in milliseconds.

Comparison of GPU and CPU Eigenface Search Implementation Times Using OpenCL
Figure 6 Comparison of GPU and CPU Eigenface Search Implementation Times Using OpenCL

The kernels were written with the intent to learn and not to get the best possible performance, so take these comparisons with a grain of salt. Indeed, the performance tuning of kernels is an involved process, and an important step prior to deploying a program. Pipelining buffer copies, reducing the number of registers in kernels and keeping memory accesses coherent are some techniques employed to optimize the kernels.

The fact that the same OpenCL host program can simultaneously queue work for all the different devices available on a system means you can further increase throughput by employing all these devices. If you’re running face detection on video security footage, for example, you could divide each frame or subframe among the different device queues proportionate to their throughputs. The system memory and system buses are still shared and could become bottlenecks, so, unfortunately, you can’t scale the performance indefinitely by adding more devices.

Eigen Anything

Though this article focused on detecting the faces of people, there’s nothing that prevents this approach from being applied to detecting other entities or objects. The trick is to collect a good number of test images, do some preprocessing to make them all the same size, register the location of some salient feature and then calculate the eigenvectors. Once the eigenvectors have been created, the search code remains unchanged. So, if you’re developing code to detect hands for a nail polish-visualizing application, you could take a database of hands and generate the “eigenhands” and search in an image for where the hands might be. With augmented reality applications on the rise, these types of applications are likely to appear.

Wrapping Up

For a programmer delving into GPU development for the first time, there could be a steep learning curve, but the rewards can be well worth the investment. GPUs have significant amounts of compute power that can be harnessed with programming languages such as OpenCL. This article serves as an introduction to GPU programming while demonstrating the popular real-world problem of finding faces in an image using the eigenfaces approach. This approach can be extended to other shapes and objects.

Kishore Mulchandani develops 3-D graphics applications and does performance improvements in parallel code running on CPUs and GPUs. High-performance computing, benchmark development for graphics, 3-D shape reconstruction from 2-D image data using computer vision, and virtual reality edutainment applications are some of his latest interests and endeavors. He can be reached at Kishore@vanishinglines.com.

Thanks to the following Microsoft technical experts for reviewing this article: James McCaffrey, Ravi Shankar Kolli

Discuss this article in the MSDN Magazine forum