Motion estimation in video processing refers to finding a sub-block in a *reference picture* that most accurately resembles or *predicts* a block in a *target picture*. The horizontal and vertical displacement between the target block and the reference block is represented by a two dimensional *motion vector*.

Spatial domain block matching techniques are most commonly used for motion estimation in video compression encoders and other video processing applications. They work by computing a *distance criterion* value between a candidate reference block and the target block, according to some candidate motion vector. A common distance criterion function is *sum of absolute differences* (SAD). The computations need to be repeated for every candidate motion vector, and the one with the smallest distance criterion value is taken as the correct motion vector.

The phase correlation approach uses a frequency domain transformation to find the motion vector in a single iteration. In some applications it could obtain the motion vector with less computations than the spatial domain block matching approach. It may also find the motion vector more accurately when there are extraneous differences between target and reference, such as a different illumination levels or image noise.

We studied motion estimation by phase correlation and wrote a notebook article on it, which you can see on this Jupyter notebook page.

We started with a known test image from which we could see the true motion. Here are the first two frames of the pedestrian_area test clip:

We did the phase correlation operations in just the horizontal dimension so we could visualize the data at each step and get a conceptual idea of what the computations were doing. Then we applied the algorithm to four different places on the image as shown above to see if the real motion was correctly identified.

The results so far aren’t very compelling. It was expected that the phase correlation response would show a strong peak at a position that corresponds to the translation between the reference and target samples.

Here is the result for the segment at position (500, 900):

We do see a peak showing a translation of about 20 pixels and this is the correct amount. But there are other peaks that are almost as high that don’t correspond to the real motion. Some of the results for the other segments were even more ambiguous.

The notebook article has all the details on the underlying math and the code used to generate the results. There are also some notes on possible reasons why the technique didn’t work very well.

This is part 1. More coming . . .