Researchers from MIT and Microsoft propose a practical and robust method of videoconferencing called Gemino which uses a neural compression system


We’ve all seen the importance of good quality video conferencing tools during COVID lockdowns. Education, entertainment, business meetings and family visits became video conferencing and we spent hours finding the tool that gave us the best visual quality. Face-to-face communication continued through our screens when faces were far apart.

However, not all of us have been blessed with this replacement communication technology. Those who suffered from a poor internet connection spoke with pixels and strange artifacts of their loved ones instead of the faces they were used to. To make matters worse, the video transmission was suspended if their network conditions got really unstable.

There have been attempts to solve this problem and provide high quality video conferencing even under poor network conditions using deep learning methods.

When it comes to video conferencing, the most essential part of video is the face of the person talking. Background details may be lower quality and this would not affect the quality of experience for many people. Therefore, deep learning-based video conferencing solutions focus on improving faces.

These methods attempt to save bandwidth by generating each video frame using the compressed representation. During this generation phase, the face images are synthesized. This can significantly reduce bandwidth. However, robustness is an issue, especially in the face synthesis part. Since these approaches synthesize faces by warping a reference image into different poses and orientations, reconstruction is problematic when facial motion is high.

Additionally, the high computational complexity of the process makes them impractical to use in most high resolution scenarios, which is the case in most video conferencing applications these days.

To solve these problems, Gemino is proposed by researchers from MIT and Microsoft. Gemino is a neural compression system designed to improve the quality of videoconferencing.

Existing methods transmit keyframes of a video instead of the entire video, hoping to synthesize content on the client side to reduce bandwidth. This can be problematic in some cases. For example, suppose an object, like the speaker’s hand, that was not in the frame suddenly appears. In this case, it is impossible to synthesize this object from the information transmitted. The only way is to send a new reference frame with the object inside, which will incur a significant cost for the network.

To solve this problem, Gemino transmits the entire video with a low resolution which contains much more information than the keyframes. The video is then scaled on the client side to the desired resolution. This is possible because the size of low-resolution video can be reduced significantly and become almost negligible when compressed with modern video codecs.

The main issue in this scenario is scaling and restoring information lost in the low-resolution space. To accurately restore high frequency detail, Gemino uses a reference frame that gives this texture information in a different position than the target frame. It distorts features extracted from this reference frame based on movement between the reference and target frames, similar to synthesis techniques. Then Gemino combines it with the information collected from the target low-resolution image to build the final reconstruction. This approach is called high frequency conditional super-resolution.

Additionally, Gemino trains a custom model by fine-tuning the network on a specific person’s videos to more accurately render high-frequency information belonging to that person. Additionally, encoded low-resolution video frames are included in the dataset to increase robustness against codec-induced artifacts. Finally, to reduce the computational complexity, a multi-scale architecture is used where different system components operate in different resolutions depending on the complexity of the problem.

This was a brief summary of Gemino. If you want to know more about it, you can check out the document using the link below.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Gemino: Practical and Robust Neural Compression for Video Conferencing'. All Credit For This Research Goes To Researchers on This Project. Check out the paper.

Please Don't Forget To Join Our ML Subreddit

Ekrem Çetinkaya obtained his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis on image denoising using deep convolutional networks. He is currently pursuing a doctorate. degree at the University of Klagenfurt, Austria, and working as a researcher on the ATHENA project. His research interests include deep learning, computer vision and multimedia networks.


About Author

Comments are closed.