Tuesday, January 15, 2013

C++ : Facial Detection

Making a facial detection engine for a freelancing job.  Very interesting stuff.  I'll try to make this as readable and not-technical as possible, but keep in mind it's still fairly involved.

The following is based on the paper "Scale Invariant Face Recognition Method Using Spectral Features of Log-Polar Image" (which I'll refer to as the Hiroshima paper from now on, because of the URL).

Problem Statement:
Given a sample of huge yearbook photos (I mean 5184 px by 3456 px), detect the location of the face and then crop an area around it for the yearbook.  Provided are also finished samples that can be used for learning.

Criteria:
  • The program should be reasonably fast, as there are very many pictures to process.
  • Accuracy should be paramount even at the sacrifice of speed.
  • Each photo has one face and one face only; therefore the program only needs to identify where the face is.
  • The program should be able to rotate the image if necessary.
  • Lighting conditions are generally standardized; the program is not responsible for editing apart from cropping and rotating.

Algorithm:

The Hiroshima paper describes an algorithm that's actually very straightforward.

  1. Convert each image into Log-Polar form for better representation.
    1. Take each pixel and find the corresponding polar coordinate at (exp(radius), angle).  The radius should be the x-axis while the angle should be the y-axis.  This ensures that any changes in scaling with the original image will only be represented along the x-axis, while any changes in rotation will go along the y-axis.
  2. Process each Log-Polar image for spectral features in the horizontal direction.
    1. In essence, you're looking at the autocorrelation of each row, processing each pixel as a cross-correlation with every other pixel.  This helps to provide scale-invariance; since scale is represented only in the horizontal axis, the autocorrelation along that axis will not reflect changes in scale.
      Note: This is a property of autocorrelations, but not something I really understand.
  3. Construct a "mean face"; that is, many photos of faces aligned on top of each other in grayscale.  Convert to Log-Polar and spectral features with the same process.
  4. Get the total covariance between the spectral features of the mean face and a sliding window of the same size of the image.
    1. Get the size of the mean face window.
    2. Move it around to every possible position inside of the original image.  Crop this area and get the spectral features via the process above.
    3. Get the covariance of each pixel between the images, totalling them up.  This step is something I came up with, with no real knowledge of statistics, and it's probably completely bogus.  It is not outlined in the Hiroshima paper, which opted for Linear Discriminant Analysis instead.  Makes sense, since the paper is interested more in detecting whether or not there's a face and less of where the face actually is.
  5. Lowest covariance means best match.  If necessary, use a mean-shift algorithm to converge to the best point.
    1. Again, this is completely bogus and should be taken with a grain of salt.  I don't know if one is supposed to add covariances like that.
Problems:
  1. Finding the spectral image of an image is an O(n^2) process via the naive way.  It's very slow.  Hence, running it for every possible window in Step 4 in an image is completely insane and the algorithm is very slow.
  2. Not actually sure if covariances help to determine better matches or not; seems to be a hit-and-miss with my results.
  3. Searching for the lowest covariance without looking at the surrounding neighborhood is very bad, especially if there's a huge false positive off in the corner of the picture or something.  The mean-shift helps to compensate but not completely.
  4. Not rotation-invariant.
Optimizations:
  1. Instead of finding autocorrelations in Step 4 the naive way (which is O(n^2) ), one can use a property of Fourier transforms to speed it up to O(n log n).  The optimization is based on the Wiener-Khinchin theorem, which states that any autocorrelation is simply the Fourier transform of the absolute square of the sample function.  The wiki article explains it better.
  2. Scale down your images to something manageable before scaling up.
Results:

Haven't had a chance to test extensively, but it got one sample spot-on and it hit the eyebrows of another.  My implementation of the Wiener-Khinchin theorem is off as I'm not getting the same results.  Needs work.

I'd post pictures if I was allowed to (don't have permission from the client).  Might end up dumping this one and using a different algorithm instead.

Sorry, I'm really geeking out today.

4 comments:

  1. You can probably guessJanuary 17, 2013 at 6:57 PM

    Yep. Just going to sit here, cry for my future, and wish I were doing something half as neat.

    ReplyDelete
    Replies
    1. What no! Facial recognition and programming are two of the most boring things ever, I'd never wish this upon anyone, I'm just trapped in my hobby. I think the stuff you do are neat. Not even trying to be nice, super-jealous of your art.

      Hope you're doing well, btw.

      Delete
  2. I'm curious to find out what blog system you have been utilizing? I'm experiencing some small security problems with my latest site and I would
    like to find something more safe. Do you
    have any solutions?

    my web page :: bathroom remodeling

    ReplyDelete
  3. certainly like your web-site but you have to test the spelling on several of
    your posts. Many of them are rife with spelling issues and I in finding
    it very troublesome to inform the truth nevertheless I'll surely come again again.

    my web-site kitchen remodeling

    ReplyDelete