The camera is a Sony EVI-D30 with a Kenko 0.42X Wide Conversion lens. The camera provides serial control of pan, tilt, and zoom. For the purposes of the foveation feature of the live video page we want to be able to orient the camera such that the selected object is at the center of the frame. This allows us to zoom the camera and obtain a tight shot of the object of interest.
The task is to create a mapping between image coordinates (x,y) and pan-tilt settings (P,T), so that given an image pair (x,y) was can drive the camera to the configuration (P,T) and obtain an image where the image patch previously at (x,y) is now at (320,240), the center of the frame.
The first experiment was to gather data about this mapping to determine it's character. This could have been done by hand in this way:
Step 2 was automated using normalized correlation tracking. A location was picked in step 1 and the image patch around that location, I, was saved. The camera was moved a small amount and a new image patch, J, was saved. Normalized correlation was used to determine the position in J that most resembled the original patch I. If this location was closer to the center of the image then it became the new target. The process continued until the target was at the center of the image (a success), or an iteration bound is reached (a failure).
A small optimization of this procedure is to start off with an approximately correct mapping. If small steps are taken then the inaccuracies in this "initial guess" mapping won't harm us, and the algorithm will eventually converge. The foveation code is in this shell script, and here is the main loop of normcorr.
Step 1 was automated by picking (x,y) values at random. This means that the computer may choose to foveate regions where there is nothing to track. This will result in more failures, but that just means that we'll have to let the process run longer.
The following image shows the data gathered by above method over the course of a weekend. Green rectangles represent good data. Yellow rectangles represent failures identified by the tracker. Red rectangles represent mistakes (supposedly valid tracker output that is in error).
By plotting output parameters as a function of input parameters we can get an idea what kind of mapping we're going to have to approximate by the shape of the resulting plots. The plot below show the surprising result that P is a linear function of x. Similarly T seems to be a linear function of y:


Since the mapping is linear it looks like this:

If we expand the equations a bit by adding in zeros to create some symmetry we get this system of equations:


The above equation is a simple linear matrix equation. There are many ways to solve such systems, but since we don't expect numerical stability to be a problem, we can safely use the pseudo-inverse method:


The solution is then

Note: I've recently changed offices. Because the camera is now in a slightly more zoomed state by default, I had to re-calibrate the camera. Because I knew that the model was linear I only needed to collect a few points by hand an resolve for lambda.
Go Up a Level
©1998 Christopher R. Wren.
Use of this material without prior written consent is strictly forbidden.