Chromebook* notebook computers powered by an Intel® SoC FPGA provide excellent power and performance for a variety of workloads. We aim to achieve great power efficiency and user experience for video conferencing scenarios, such as Google Meet, on Intel architecture platforms. Google Meet started using a hardware Video Acceleration API (VAAPI) VP9 K-SVC encoding on relatively new Intel platforms (such as formerly code named Tiger Lake, Alder Lake, and Raptor Lake) in 2022. Recently, there have been some power regression issues for multiparty calls in Google Meet. The media driver consumes a lot of power and CPU cycles in the VAAPI VP9 K-SVC encoding part. The power regression issue is caused when the media driver padding is used to fix the green line of corruption at the bottom of the frame during hardware video encoding. This document summarizes how memory padding issues regressed power efficiency and the optimizations that were done to regain power efficiency for Google Meet multiparty calls.
Introduction
Google Meet is one of the most user-friendly video conferencing platforms in the market. We would like to make sure Google Meet can deliver excellent power and performance experiences on Intel architecture platforms. In Google Meet video conferencing, multiple stream encoding is used to adapt different network bandwidths for different end users, so the input video frame would be encoded in different resolution streams.
As shown in figure 1, in Chromium, the video capture service outputs the NV12 GpuMemoryBuffer based video frame and sends it to the VAAPI video encode accelerator. The GpuMemoryBuffer based video frame would be imported as VASurface to the Intel® Media Driver for VAAPI. In the VP9 K-SVC encoding case, the encoder needs to generate three streams for different resolutions (180p, 360p, and 720p). However, the resolution of an input video frame is 720p, so VAAPI VPP scaling is used in blitting 720p surface to 360p and 180p, which is the middle and bottom layer in the produced spatial layer. Unlike the 720p surface that is imported from the gbm buffer created by minigbm with a linear format (DRM_FORMAT_MOD_LINEAR), 360p surface and 180p surface are created by the media driver with a tiled Y format (DRM_FORMAT_MOD_Y_TILED). Minigbm aligns the buffer width and height by 64 and 4, respectively. But the driver needs the buffer for VP9 encoding to align width and height by CODEC_VP9_MIN_BLOCK_WIDTH/HEIGHT (8x8), respectively. To fix the green line issue at the bottom of the video, the media driver landed a fix that filled the padding region with last-row data. Since the unaligned height 180p buffer is tiled Y format, that fix first locks the buffer and then copies the frame data to a linear buffer to fill the padding at a frame level. Then it finally copies the padded linear buffer back to the entire tiled Y buffer for the raw surface and the reconstructed surface respectively. This is the root cause of both power and performance regression for the Google Meet KPI case.
Initial Analysis
First, we analyzed the necessity of padding for VP9 encoding again. This led us to determine the root cause of the green line quality issue for hardware encoding in 180p and 270p resolutions whose height is not 8 aligned. It can be found that the bottom region consumes lots of bits, which makes the quality of the bottom region quite bad for every frame. To investigate this issue, we took a deep look into the reconstructed surface and verified that the padding of the reconstructed surface is nonzero values. That is a mismatch with the padding region, which are zeros in the input raw surface. So, the residual is large when doing motion estimation for the padding regions, which consumes lots of bits and leads to bad quality. The fix of filling the padding with last-row data can make the raw surface and the reconstructed surface match more closely in the padding region, which fixed the green line quality issue. But the padding implementation in the current media driver is not power efficient. The diagram in figure 3 below shows an optimized padding implementation.
On one hand, we do not need to fill the padding for the reconstructed surface because it already has some predicted data in the padding, which is similar to the last-row data in the raw surface. So, we removed the filling padding operation for the reconstructed surface, which can remove the 2x frame-level memory copies.
On the other hand, filling proper data to the padding for the input raw surface is needed to make the padding region more aligned between the input raw surface and the reconstructed surface, but we do not need the 2x frame-level memory copies to fill the padding. We only need to fill the padding region (four lines for the 180p frame and two lines for the 270p) for the input surface. In the optimized implementation, we lock and map the tiled Y hardware buffer as tiled Y CPU access buffer, and then set the proper pixel value to the destination coordinates through the software swizzle and de-swizzle algorithm. This does the address mapping from the corresponding linear address to the tiled Y address.
Power and Performance Improvement
The encoded time for an SVC frame is downgraded to 18ms from 10ms after the media driver fixing the green lines issue patch landed in the master branch, as shown in figure 4.
The 4x frame-level memory copy costs many CPU cycles. After the optimized padding implementation landed in the media driver, we verified the encode time for the SVC frame came back to the 10ms level.
On the other hand, we also measured the power consumption for two padding cases. The chart in figure 5 shows the power consumption data on a formerly code named Raptor Lake processor for Google Meet in a call with 10 participants.
The iHD memory padding optimization provides an approximately 9% power savings on Google Meet for a 10-party call, which results in increased battery life for all users who use Chromebooks with formerly code named Raptor Lake or Alder Lake processors.
Conclusions and Future Work
The media driver can only process macroblocks at 8x8 for VP9 and incoming AV1 encoding, and this requires the driver to properly fill the padding region for VP9. The driver must fill the paddings with last-row data for VP9 to make the padding region data match more closely between the input and reconstructed surface because the VP9 reconstructed surface has 8-alignment limitations. Conversely, memory copy is really a major contributor to power consumption. We should eliminate the memory copy as much as possible, such as by directly filling the padding at the original buffer address. Next, we can expand this quality and padding issue to VP8/H264/AV1 on all platforms. For example, the tiled Y format is replaced by the tiled 4 format on formerly code named Meteor Lake processors, so we need to add tiled 4 swizzle and de-swizzle algorithms to fill the padding on the tiled 4 buffer directly for VP9.
We plan to continue improving the hardware encoding quality and power performance metrics for more customers.