Tuesday, January 22, 2013

C++ : Facial Detection - Viola-Jones

So I completely abandoned what I had earlier because it wasn't great; here's another attempt using the well-known Viola-Jones paper.

The algorithm is fairly straightforward.  Imagine you have a few hundred images of faces (24 px by 24 px) and a thousand times more images of non-faces (also 24 px by 24 px).  The program first tries to "learn" a classifier based on Haar-like rectangular features.  A "feature" is essentially all of the pixels in one area of the image added together, minus the pixels in another area added together.  The "weak classifier" takes this value and classifies it as "face" or "not face" based on a threshold and a polarity.  So, if your threshold is 50, and your polarity is "greater than," the weak classifier will classify an image as a face if the sum of one area minus the sum of another is "greater than 50."

Image by Viola-Jones, examples of Haar-like features
In the above image, you'd sum up the pixels in the white rectangle minus the pixels in the black rectangles for your feature value.  The weak classifier returns a 1 or a 0, depending on whether this value is greater than or less than (depending on your polarity) the threshold for that classifier (1 represents "face," 0 represents "not face").

You can see in general how it works; the eyebrows are typically darker than the cheeks, and the bridge of the nose is typically brighter than the two sides of the face around it.  The classifier is considered "weak" because it just so happens that no single classifier is very good at recognizing faces.  The "strong classifier" comes from combining a lot of weak classifiers together based on their weighted errors (a somewhat complex process that you'll want to look in the paper for).

In order to find the weak classifiers, the algorithm uses AdaBoost, which essentially brute-forces all possible feature/polarity/threshold combinations for the one that mis-classifies the smallest number of images.  This is sadly an O(AB) process, where A is the number of possible features (160,000+) and B is the number of face/non-face images (at least 20000).  (Of course, this process is also repeated for the number of weak classifiers you are finding, so it might as well be O(ABC), where C is the number of classifiers).  This is something like 3,200,000,000 operations, not counting K.

There are some optimizations one can make.  Finding the sum of the pixels in any rectangle can be done in constant time (O(1)) if you use an integral image.  That is, create an integral image where each pixel represents the sum of all the pixels to the top and left of itself.

Image by Chuan-Heng Hsiao and Amy Dai
The sum of the rectangle ABCD, above, is then just C - B - D + A.  This is given by the fact that the rectangle at point C is all of the area from the top-left corner to C.  Cutting off D and B (same principle) means we subtract the area represented by A twice; we compensate by adding A at the end.  Hence, it becomes very easy to find the sum of any rectangle.  (Constructing the integral image is another story -- you have to iterate through every pixel -- although since rectangles overlap there are some optimizations you can make there as well).

In a nutshell:

  1. Find a bunch of rectangular features that determine face-ly-ness of an image by brute-force en masse.
  2. Combine the features and their weighted errors to make a strong classifier.

Horribly slow for the aformentioned reason (on the magnitude of days to weeks to train).

There's an implementation in OpenCV for it but I might GitHub my own implementation.  Granted, I'm not willing to test it given the long computational time.  It's probably all wrong.

Images and papers credit given where due (Viola and Jones for Viola-Jones (surprise!) and Freund, Schapire for AdaBoost).

No comments:

Post a Comment