The Bjøntegaard Bible Why Your Way of Comparing Video Codecs May Be Wrong

Christian Herglotz; Hannah Och; Anna Meyer; Geetha Ramasubbu; Lena Eichermuller; Matthias Kranzler; Fabian Brand; Kristian Fischer; Dat Thanh Nguyen; Andy Regensky; Andre Kaup

doi:10.1109/TIP.2023.3346695

The Bjøntegaard Bible Why Your Way of Comparing Video Codecs May Be Wrong

IEEE Trans Image Process. 2024:33:987-1001. doi: 10.1109/TIP.2023.3346695. Epub 2024 Jan 26.

Authors

Christian Herglotz, Hannah Och, Anna Meyer, Geetha Ramasubbu, Lena Eichermuller, Matthias Kranzler, Fabian Brand, Kristian Fischer, Dat Thanh Nguyen, Andy Regensky, Andre Kaup

PMID: 38231816
DOI: 10.1109/TIP.2023.3346695

Abstract

In this paper, we provide an in-depth assessment on the Bjøntegaard Delta. We construct a large data set of video compression performance comparisons using a diverse set of metrics including PSNR, VMAF, bitrate, and processing energies. These metrics are evaluated for visual data types such as classic perspective video, 360° video, point clouds, and screen content. As compression technology, we consider multiple hybrid video codecs as well as state-of-the-art neural network based compression methods. Using additional supporting points in-between standard points defined by parameters such as the quantization parameter, we assess the interpolation error of the Bjøntegaard-Delta (BD) calculus and its impact on the final BD value. From the analysis, we find that the BD calculus is most accurate in the standard application of rate-distortion comparisons with mean errors below 0.5 percentage points. For other applications and special cases, e.g., VMAF quality, energy considerations, or inter-codec comparisons, the errors are higher (up to 5 percentage points), but can be halved by using a higher number of supporting points. We finally come up with recommendations on how to use the BD calculus such that the validity of the resulting BD-values is maximized. Main recommendations are as follows: First, relative curve differences should be plotted and analyzed. Second, the logarithmic domain should be used for saturating metrics such as SSIM and VMAF. Third, BD values below a certain threshold indicated by the subset error should not be used to draw recommendations. Fourth, using two supporting points is sufficient to obtain rough performance estimates.