Learned frame prediction for video codecs

Standard video codecs use a “motion compensation” module to predict the next frame. This way, rather than encoding video frames in their entirety, they only encode motion vectors that map blocks from the previous frame, alongside the residual image which is rather sparse.

In this project, I replace the motion compensation module with a deep neural network (DNN), eliminating the use of motion vectors. This DNN can predict the entire frame in a single forward-pass, without relying on blocks or motion prediction. This approach outperforms the well-established x264 video codec.

I furthermore show that by incorporating adversarial learning, it is possible to obtain more realistic video prediction results, at the cost of higher PSNR. The code and the papers are available (Sulun & Tekalp, 2021), (Sulun, 2018). Supplementary material is presented here:

Supplementary material

Here are the qualitative results for our learned video prediction models. In particular, we compare the performance of two models, one trained with L2 loss (L2) and the other with adversarial loss and L2 loss combined (GAN). All results belong to the 9th frames of each video. The slideshow provides an easy way to compare images, enabling navigation using the arrow keys of the keyboard. Below the slideshow, ground-truth and prediction images are available at their original size.

References

Can Learned Frame Prediction Compete with Block Motion Compensation for Video Coding?

Serkan Sulun and A. Murat Tekalp

Signal, Image and Video Processing, 2021

Abs DOI Bib PDF Supp Code

Given recent advances in learned video prediction, we investigate whether a simple video codec using a pre-trained deep model for next frame prediction based on previously encoded/decoded frames without sending any motion side information can compete with standard video codecs based on block-motion compensation. Frame differences given learned frame predictions are encoded by a standard still-image (intra) codec. Experimental results show that the rate-distortion performance of the simple codec with symmetric complexity is on average better than that of x264 codec on 10 MPEG test videos, but does not yet reach the level of x265 codec. This result demonstrates the power of learned frame prediction (LFP), since unlike motion compensation, LFP does not use information from the current picture. The implications of training with L1, L2, or combined L2 and adversarial loss on prediction performance and compression efficiency are analyzed.
@article{frame_prediction, ids = {sulun2020can}, title = {Can Learned Frame Prediction Compete with Block Motion Compensation for Video Coding?}, author = {Sulun, Serkan and Tekalp, A. Murat}, year = {2021}, journal = {Signal, Image and Video Processing}, volume = {15}, number = {2}, pages = {401--410}, publisher = {Springer}, doi = {10.1007/s11760-020-01751-y}, }
Deep Learned Frame Prediction for Video Compression

Serkan Sulun

Koc University, 2018

Master’s Thesis

Abs DOI Bib PDF Code

Motion compensation is one of the most essential methods for any video compression algorithm. Video frame prediction is a task analogous to motion compensation. In recent years, the task of frame prediction is undertaken by deep neural networks (DNNs). In this thesis we create a DNN to perform learned frame prediction and additionally implement a codec that contains our DNN. We train our network using two methods for two different goals. Firstly we train our network based on mean square error (MSE) only, aiming to obtain highest PSNR values at frame prediction and video compression. Secondly we use adversarial training to produce visually more realistic frame predictions. For frame prediction, we compare our method with the baseline methods of frame difference and 16x16 block motion compensation. For video compression we further include x264 video codec in the comparison. We show that in frame prediction, adversarial training produces frames that look sharper and more realistic, compared MSE based training, but in video compression it consistently performs worse. This proves that even though adversarial training is useful for generating video frames that are more pleasing to the human eye, they should not be employed for video compression. Moreover, our network trained with MSE produces accurate frame predictions, and in quantitative results, for both tasks, it produces comparable results in all videos and outperforms other methods on average. More specifically, learned frame prediction outperforms other methods in terms of rate-distortion performance in case of high motion video, while the rate-distortion performance of our method is competitive with that of x264 in low motion video.
@phdthesis{frame_prediction_thesis, title = {Deep Learned Frame Prediction for Video Compression}, author = {Sulun, Serkan}, year = {2018}, eprint = {1811.10946}, primaryclass = {cs, eess}, address = {Istanbul, Turkey}, urldate = {2022-07-20}, archiveprefix = {arXiv}, school = {Koc University}, doi = {10.48550/arXiv.1811.10946}, note = {Master's Thesis}, }