Tuesday, January 22, 2013

C++ : Facial Detection - Viola-Jones

So I completely abandoned what I had earlier because it wasn't great; here's another attempt using the well-known Viola-Jones paper.

The algorithm is fairly straightforward.  Imagine you have a few hundred images of faces (24 px by 24 px) and a thousand times more images of non-faces (also 24 px by 24 px).  The program first tries to "learn" a classifier based on Haar-like rectangular features.  A "feature" is essentially all of the pixels in one area of the image added together, minus the pixels in another area added together.  The "weak classifier" takes this value and classifies it as "face" or "not face" based on a threshold and a polarity.  So, if your threshold is 50, and your polarity is "greater than," the weak classifier will classify an image as a face if the sum of one area minus the sum of another is "greater than 50."

Image by Viola-Jones, examples of Haar-like features
In the above image, you'd sum up the pixels in the white rectangle minus the pixels in the black rectangles for your feature value.  The weak classifier returns a 1 or a 0, depending on whether this value is greater than or less than (depending on your polarity) the threshold for that classifier (1 represents "face," 0 represents "not face").

You can see in general how it works; the eyebrows are typically darker than the cheeks, and the bridge of the nose is typically brighter than the two sides of the face around it.  The classifier is considered "weak" because it just so happens that no single classifier is very good at recognizing faces.  The "strong classifier" comes from combining a lot of weak classifiers together based on their weighted errors (a somewhat complex process that you'll want to look in the paper for).

In order to find the weak classifiers, the algorithm uses AdaBoost, which essentially brute-forces all possible feature/polarity/threshold combinations for the one that mis-classifies the smallest number of images.  This is sadly an O(AB) process, where A is the number of possible features (160,000+) and B is the number of face/non-face images (at least 20000).  (Of course, this process is also repeated for the number of weak classifiers you are finding, so it might as well be O(ABC), where C is the number of classifiers).  This is something like 3,200,000,000 operations, not counting K.

There are some optimizations one can make.  Finding the sum of the pixels in any rectangle can be done in constant time (O(1)) if you use an integral image.  That is, create an integral image where each pixel represents the sum of all the pixels to the top and left of itself.

Image by Chuan-Heng Hsiao and Amy Dai
The sum of the rectangle ABCD, above, is then just C - B - D + A.  This is given by the fact that the rectangle at point C is all of the area from the top-left corner to C.  Cutting off D and B (same principle) means we subtract the area represented by A twice; we compensate by adding A at the end.  Hence, it becomes very easy to find the sum of any rectangle.  (Constructing the integral image is another story -- you have to iterate through every pixel -- although since rectangles overlap there are some optimizations you can make there as well).

In a nutshell:

  1. Find a bunch of rectangular features that determine face-ly-ness of an image by brute-force en masse.
  2. Combine the features and their weighted errors to make a strong classifier.

Horribly slow for the aformentioned reason (on the magnitude of days to weeks to train).

There's an implementation in OpenCV for it but I might GitHub my own implementation.  Granted, I'm not willing to test it given the long computational time.  It's probably all wrong.

Images and papers credit given where due (Viola and Jones for Viola-Jones (surprise!) and Freund, Schapire for AdaBoost).

Sunday, January 20, 2013

Verizon App Challenge

Me and a few friends are doing this Verizon Innovative App Challenge thing where we think of something communities need and pitch the idea for an app, or something.  So I made this video.  Enjoy, I think.

Tuesday, January 15, 2013

C++ : Facial Detection

Making a facial detection engine for a freelancing job.  Very interesting stuff.  I'll try to make this as readable and not-technical as possible, but keep in mind it's still fairly involved.

The following is based on the paper "Scale Invariant Face Recognition Method Using Spectral Features of Log-Polar Image" (which I'll refer to as the Hiroshima paper from now on, because of the URL).

Problem Statement:
Given a sample of huge yearbook photos (I mean 5184 px by 3456 px), detect the location of the face and then crop an area around it for the yearbook.  Provided are also finished samples that can be used for learning.

  • The program should be reasonably fast, as there are very many pictures to process.
  • Accuracy should be paramount even at the sacrifice of speed.
  • Each photo has one face and one face only; therefore the program only needs to identify where the face is.
  • The program should be able to rotate the image if necessary.
  • Lighting conditions are generally standardized; the program is not responsible for editing apart from cropping and rotating.


The Hiroshima paper describes an algorithm that's actually very straightforward.

  1. Convert each image into Log-Polar form for better representation.
    1. Take each pixel and find the corresponding polar coordinate at (exp(radius), angle).  The radius should be the x-axis while the angle should be the y-axis.  This ensures that any changes in scaling with the original image will only be represented along the x-axis, while any changes in rotation will go along the y-axis.
  2. Process each Log-Polar image for spectral features in the horizontal direction.
    1. In essence, you're looking at the autocorrelation of each row, processing each pixel as a cross-correlation with every other pixel.  This helps to provide scale-invariance; since scale is represented only in the horizontal axis, the autocorrelation along that axis will not reflect changes in scale.
      Note: This is a property of autocorrelations, but not something I really understand.
  3. Construct a "mean face"; that is, many photos of faces aligned on top of each other in grayscale.  Convert to Log-Polar and spectral features with the same process.
  4. Get the total covariance between the spectral features of the mean face and a sliding window of the same size of the image.
    1. Get the size of the mean face window.
    2. Move it around to every possible position inside of the original image.  Crop this area and get the spectral features via the process above.
    3. Get the covariance of each pixel between the images, totalling them up.  This step is something I came up with, with no real knowledge of statistics, and it's probably completely bogus.  It is not outlined in the Hiroshima paper, which opted for Linear Discriminant Analysis instead.  Makes sense, since the paper is interested more in detecting whether or not there's a face and less of where the face actually is.
  5. Lowest covariance means best match.  If necessary, use a mean-shift algorithm to converge to the best point.
    1. Again, this is completely bogus and should be taken with a grain of salt.  I don't know if one is supposed to add covariances like that.
  1. Finding the spectral image of an image is an O(n^2) process via the naive way.  It's very slow.  Hence, running it for every possible window in Step 4 in an image is completely insane and the algorithm is very slow.
  2. Not actually sure if covariances help to determine better matches or not; seems to be a hit-and-miss with my results.
  3. Searching for the lowest covariance without looking at the surrounding neighborhood is very bad, especially if there's a huge false positive off in the corner of the picture or something.  The mean-shift helps to compensate but not completely.
  4. Not rotation-invariant.
  1. Instead of finding autocorrelations in Step 4 the naive way (which is O(n^2) ), one can use a property of Fourier transforms to speed it up to O(n log n).  The optimization is based on the Wiener-Khinchin theorem, which states that any autocorrelation is simply the Fourier transform of the absolute square of the sample function.  The wiki article explains it better.
  2. Scale down your images to something manageable before scaling up.

Haven't had a chance to test extensively, but it got one sample spot-on and it hit the eyebrows of another.  My implementation of the Wiener-Khinchin theorem is off as I'm not getting the same results.  Needs work.

I'd post pictures if I was allowed to (don't have permission from the client).  Might end up dumping this one and using a different algorithm instead.

Sorry, I'm really geeking out today.

Tuesday, January 8, 2013

Python : Awful Jokes Bot

Made a Twitter app to generate and post awful jokes via Python.


Was a pretty interesting experiment, thought I'd share how it works:

The bot has a giant nouns.txt file stored locally, where it picks out two random nouns.  The two nouns are translated into ARPABET using a lookup table (O(n) time; a binary search can cut it down to O(log n) but I'm lazy).

The first and last few phonemes are matched between the two words (the prefix of one and the suffix of the other), and if there's a match, combine the two words.  E.g., 'napkin' and 'instrument' make 'napkininstrument.'

In order to remove the shared phonemes, the program estimates how many consonants to remove by counting consonants in the phonemes.  It's very sketchy and a syllabic approach would be much better.

You end up getting 'napkinstrument' if all goes well.

Now, to generate the joke (yes, it goes backwards), the program looks up the Wikipedia articles of the two words and tries to find the most frequent nouns on the page (which will hopefully be keywords).  Often times, the word will be the same (one of the most common words on the "napkin" wiki page happens to be "napkin").

In this case, the program generated "napkin" and "castanet."

Putting it together:

What do you get when you cross a castanet and a napkin?  A napkinstrument!

The darker side:

The bot sucks.  A lot.  Sometimes, it just happens to work very well, but there's a bunch of issues with it.

  1. Skimming Wikipedia for keywords is not reliable.  Often times you get words that aren't even remotely related.  It might be interesting to, instead of using an arbitrary threshold like I do now (take the five most frequent nouns in the article and pick one randomly), take into account the frequencies of the nouns, and to use the more frequent ones more often.
  2. Word truncation based on consonants in phonemes is not a good idea, especially because of blends.  This should be intuitive enough; everything about it smells of hack.  A better way would be to count syllables, or somehow convert phonemes into words without relying on a dictionary.
  3. Certain words appear too often.  The bot has a "may" fetish, for example, because "may" is both a noun and a (participle?), and so when it skims Wiki articles, it tends to choose that a lot.
That's all, folks.  The bot currently uses someone else's code for the Wiki search; if I ever replace that I'll GitHub the entire thing and you can generate awful jokes right on your silly machines.