Optimize Absolute-Difference Motion Compensation on the Pentium® 4 Processor. One important technique used in video compression is to predict movement between consecutive frames. In many cases, the visual representation of a moving object stays the same from frame to frame, merely changing position within the viewing field. A substantially improved compression ratio can be achieved by producing displacement vectors of the object from frame to frame, instead of compressing the object separately in each frame. Calculating these vectors is called motion estimation, and it requires the calculation of an absolute difference between blocks of frames.
The motion-estimation function sums the absolute differences (or squared differences) between the pixel values of two different 16x16 blocks and finds the best match. In MPEG1, for example, the calculation can be made in four ways. Either the absolute difference between the pixel values (L1 norm) or the square of the differences (L2 Norm) is summed. Orthogonal to the difference equation, this sum can be accumulated with respect to a reference block that has been shifted, either by some number of whole pixel positions or by some number of half-pixel positions.
The sample below is a C code example for a 16 x 16-pixel full pel motion estimation using absolute differences. The code has a fast-out, so that if the difference so far accumulated across a subset of rows is more than the best match so far, then it aborts the rest of the absolute-difference calculation:
Usually, subtraction of two eight-bit unsigned numbers produces a nine-bit signed result. Thus, in order to retain precision, it may seem that a conversation to 16 bits is needed before the subtract operation, but such conversion is unnecessary when using the
psubusb instruction. By subtracting source 2 from source 1 and then performing the opposite operation (
on a copy of one of the sources), each result register has the absolute difference value for the case when a respective element in source 1 was larger than a respective element in source 2, and zero otherwise (
because a negative result saturates to zero). The same result occurs for the opposite subtraction, which generates the absolute differences in those cases where the elements in source 2 were larger than the elements in source 1.
Performing an
OR operation on the results of the two subtractions generates the eight desired absolute-difference results. This technique allows the absolute difference to be performed in byte precision, which is substantially faster than if it were done in 16-bit precision.
The SSE2 code for the inner-most computation of absolute difference is presented in the sample below. This code does not include the fast-out option that was incorporated into the C code given in the Challenge section. In general, this fast-out is not relevant after each 16 x 1 estimation, as the overhead cost of adding the fast-out for every iteration of the SSE2 instruction loop is prohibitive.