The Viola-Jones Object Detection Framework

The Viola-Jones object Detection Framework is an efficient supervised machine learning approach to object detection. It attempts to locate objects by searching for unique visual features. In the algorithm these features constitute large differences in pixel intensity corresponding to lines and edges.

The algorithm is particularly fast due to its implementation of an integral image to calculate features (see stage 1). Its speed allows it to be run in real time without the need for advanced hardware, making it an ideal candidate for object detection when applied to the Ethoscope platform[2].

Training the classifier

The classifier is trained on a bank of positive images (images containing the object) and negative images (images containing only the background). Training involves three different stages - click the headings to expand for more information[2].

1. Detecting distinguishing features in an object

Features are identified using a feature window - a scrolling window that passes over the image to test for regions of differing pixel intensity. Example two, three and four feature windows are shown below (Figure 1.a). For each window, the sum of pixel intensities within the white region are subtracted from those within the grey region.

The Viola-Jones algorithm generates all possible two, three and four feature windows and applies them to each of the positive and negative images. This requires a lot of calculations with the potential to become computationally hard to solve - a 24x24 image requires over 180,000 individual feature calculations. To simplify these calculations, the algorithm uses an Integral Image to pre-calculate subtractions.

The Integral Image is a transformation of the original image where the intensity value of each pixel is computed as the sum of intensity values above and to the left of it. I.e. for a pixel at location x,y:

Where ii is the integral image and i is the original. 

This vastly simplifies the process of finding the sum of intensities within a region since instead of needing to calculate a series of column sums, only the four pixels bounding the corners of a rectangular region need to be calculated (Figure 1.b) [2]

Figure 1.a. Example two, three and four feature windows. Windows A and B are used to detect the presence of vertical and horizontal edges. Window C is used to detect vertical lines while Window D is used to detect diagonal lines.

Figure 1.b. Demonstrating the efficiency of the integral image. The sum of intensity in area D can be found using only the corner pixels bounding the region in the integral image (positons 1, 2, 3 & 4).

2. Construction of weak classifiers from identified features

This stage looks through the features calculated in the previous step and subsets those which are the most important for object classification. This is achieved by finding the threshold for each feature that best separates positive and negative images. From this, the best performing features can be extracted and used as ‘weak’ classifiers.[2]

3. Creating a cascade of classifiers

The strength of classification is greatly improved when multiple weak classifiers are used in conjunction - i.e. to classify an image, the conditions of multiple classifiers must be met. A form of decision tree is constructed known as a cascade that strings together multiple classifiers [2].

Applying the framework to a test image

Once the cascade is trained and constructed, it can be applied to an image using a scrolling 24x24 window. At each point, the window looks for the combinations of features dictated by the cascade [2].

Back HomeBack to Background