Qualcomm at CVPR 2023: Advancing Research and Bringing Generative AI to the Edge

This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm.

ControlNet running entirely on device, fitness coaching with an LLM, 3D reconstruction for XR, our accepted papers and much more

The annual IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) is regarded as one of the most important events not only in computer vision but also in the artificial intelligence (AI) field. This year it takes place in Vancouver from June 18 to June 22, and we are showcasing our accepted research papers and technology demos — drop by our booth 1212 to see us in person. Here are a few of our highlights at CVPR 2023.

Our CVPR demos

Our research in AI, computer vision, extended reality (XR), and autonomous vehicles expand from core theoretical innovations to downstream real-world applications. We show such examples through captivating demonstrations highlighted below.

World’s fastest ControlNet demo running on a phone

A few months ago, we showcased the world’s first demo of Stable Diffusion running on an Android phone, which is an accepted demo at CVPR this year. Now, Qualcomm AI Research is demonstrating ControlNet, a 1.5 billion parameter image-to-image model, running entirely on a phone as well. ControlNet is a class of generative AI solutions known as language-vision models, or LVMs. It allows more precise control for generating images by conditioning on an input image and an input text description. In this demo, AI images are generated on the mobile device in under 12 seconds without requiring any cloud access, allowing an interactive user experience that is efficient, enjoyable, reliable and private. The impressive performance was achieved with a suite of full-stack AI optimizations across the model architecture, AI software and neural hardware accelerators. Our advanced AI tools and hardware used for this process include the AI Model Efficiency Toolkit (AIMET), the Qualcomm AI Stack and the Qualcomm AI Engine.

Fitness coaching with an LLM grounded in real-time vision

Qualcomm AI Research has used generative AI to develop a digital fitness coach that improves upon existing solutions in terms of accuracy and realism. The fitness coach provides real-time interaction by encouraging, correcting, and helping the user meet their fitness goals. Our demo showcases how a visually grounded large language model (LLM) can enable natural interactions that are contextual, multimodal and real-time. A video stream of the user exercising is processed by our action recognition model. Based on the recognized action, our stateful orchestrator grounds the prompt and feeds it to the LLM. The fitness coach provides the LLM answer back to the user through a text-to-speech avatar. This is made possible thanks to three key innovations: a vision model that is trained to detect fine-grained fitness activities, a language model that is trained to generate language grounded in the visual concepts, and an orchestrator that coordinates the fluid interaction between these two modalities to facilitate live dialogue coaching feedback. The result is a fitness coach that provides real-time interaction for an engaging and dynamic user experience.

World’s first 1080p neural video coding on a phone

In another world’s first in terms of AI running on device, this demo showcases encoding and decoding 1080p videos on a mobile device. Neural codecs are versatile: they can be customized for specific video needs, can be optimized for perceptual quality through advances in generative AI, can be extended to new modalities, and can run on general-purpose AI hardware. However, they present numerous challenges which make them difficult to implement on compute-constrained devices. We designed a novel and efficient neural interframe video compression architecture which makes it possible to do 1080p video coding on device. In the demo, you can see that the rich visual structures and complex motions of the high-quality video are accurately preserved by the neural video codec.

3D reconstruction for XR

We’ve successfully developed a cutting-edge real-time 3D reconstruction system that excels in accuracy and efficiency, enabling the creation of highly detailed 3D models of any environment. Our solution runs on a mobile device, generates depth maps from individual images, and combines them into a 3D scene representation. With an accurate and real-time 3D map, developers can unlock a vast array of augmented and virtual reality applications. To showcase the capabilities of our innovation, we have designed an engaging demonstration where users can shoot virtual balls against the real objects in the scene, such as walls and furniture, witnessing realistic bounces based on accurate physics calculations. This perception technology fosters immersive experiences and promises to accelerate the widespread adoption of the metaverse.

Computer vision for smart cameras

Photo and video capture continue to improve every year with new capabilities made possible by advancements from AI-based computer vision. Our demonstration shows semantic segmentation, monocular depth estimation, and instance segmentation enabling Bokeh effects, background replacement, cinematic mode, and class-dependent image quality improvement in sharpness, smoothness, clarity and contrast. These neural networks run video enhancement in real time on devices powered by Snapdragon platforms.

Driver monitoring technology for enhanced safety

The driver monitoring system (DMS) demonstration uses computer vision to infer dangerous driving conditions and improve safety. By using active infrared cameras within the cockpit, the DMS monitors the driver’s status in real time, including distraction and drowsiness, based on eye openness, gaze, head pose, facial expression, body activities and much more. The system warns the driver when dangerous driving is detected and can ultimately help save lives. The DMS runs in parallel with Advanced Driver Assistance Systems (ADAS) on the Snapdragon Ride Flex SoC.

Facial avatars for XR

Avatars are an essential ingredient for enabling immersive XR experiences in the metaverse, whether photorealistic or cartoonish. With one or more 2D photos, we use on-device AI to generate a personalized mesh and corresponding texture. For real-time rendering of the avatar, we use headset cameras that see the movements of the user’s eyes and mouth. The resulting demonstration is an avatar that is reconstructed and animated close to ground truth and relighted according to the environment. Our goal is to make a digital human available on the Snapdragon XR platform used in the metaverse and in human-machine interfaces.

Our CVPR papers

Premier conferences, such as CVPR, play a pivotal role in advancing the AI field, as they feature meticulously peer-reviewed papers that establish the new state-of-the-art and contribute impactful research to the rest of the community. We’d like to highlight eight of our accepted papers at the main conference, advancing the frontiers in computer vision for two broad categories: making the best use of data and creating better architectures.

Making the best use of data

In our paper “DistractFlow: Improving Optical Flow Estimation Models via Realistic Distractions and Pseudo-Labeling,” we introduce a novel data augmentation technique that specifically tackles the challenge of limited data availability in training optical flow estimation models. This problem arises when representative and diverse data samples are scarce, which is inherent for motion estimation. Our proposed method overcomes this limitation by incorporating realistic distractions into the labeled input frames, enhancing the model’s generalization ability. When unlabeled data is accessible, we extend our augmentation to self-supervised settings using pseudo-labeling and cross-consistency regularization, which enables us to substantially increase the number of training pairs without requiring complex and expensive data collection. Comprehensive evaluations across multiple benchmarks show that our method consistently improves optical flow estimation performance.

Our paper, “Progressive Random Convolutions for Single Domain Generalization,” presents a data-efficient framework that uses a novel image augmentation method based on Progressive Random Convolutions (Pro-RandConv). This progressive approach mitigates semantic distortions in augmented images by reducing the influence of non-local pixels in the receptive fields of the convolutional kernels, allowing the generation of more effective and representative domains by gradually increasing the style diversity in augmentation. This generalization strategy outperforms state-of-the-art methods on single-domain and multi-domain image classification, recognition, and segmentation benchmarks.

Learning-based gaze estimation requires large amounts of training data with accurate gaze annotations. In our paper “ReDirTrans: Latent-to-Latent Translation for Gaze and Head Redirection,” we propose a neural network called ReDirTrans, achieving latent-to-latent translation for redirecting gaze directions and head orientations in high-resolution full-face images based on assigned directional values in an interpretable manner. By combining ReDirTrans with a pretrained e4e-StyleGAN pair, we create ReDirTrans-GAN, which enables accurate redirecting gaze while preserving other attributes such as identity, expression, and hairstyle.

In the paper “DejaVu: Regenerative Learning to Enhance Dense Prediction,” we show a novel framework which leverages conditional image regeneration as additional supervision during training to improve deep networks for dense prediction tasks such as segmentation, depth estimation, and surface normal prediction. Our framework encourages the base network to learn to embed accurate scene structure in its dense prediction. This leads to more accurate predictions with clearer boundaries and better spatial consistency. Through extensive experiments on multiple dense prediction benchmarks, we demonstrate the efficacy of employing our framework during training, as it outperforms state-of-the-art methods at no added computation cost.

Creating better architectures

The method presented in “X³-KD: Cross-modal Cross-stage Cross-task Knowledge Distillation for 3D Object Detection” is a comprehensive knowledge distillation framework across different modalities, tasks, and stages for multi-camera 3D object detection (3DOD). Specifically, we propose cross-task distillation from an instance segmentation teacher (X-IS) in the perspective view feature extraction stage providing supervision without ambiguous error backpropagation through the view transformation. After the transformation, we apply cross-modal feature distillation (X-FD) and adversarial training (X-AT) to improve the 3D world representation of multi-camera features through the information contained in a LiDAR-based 3DOD teacher. The model outperforms previous state-of-the-art approaches on key datasets and generalizes to RADAR-based 3DOD.

With “EcoTTA: Memory-Efficient Continual Test-time Adaptation via Self-distilled Regularization,” we present a simple yet effective approach that improves continual test-time adaptation (TTA) in a memory-efficient manner. TTA is primarily conducted on edge devices with limited memory, so reducing memory is crucial but has been overlooked in previous TTA studies. In addition, long-term adaptation often leads to catastrophic forgetting and error accumulation, which hinders applying TTA in real-world deployments. Our method consists of two components to address these issues. First, it uses lightweight meta networks to adapt the original networks to the target domain. This minimizes memory by decreasing the size of intermediate activations required for backpropagation. Second, a novel self-distilled regularization controls the output of the meta networks not to deviate significantly from the output of the original networks, thereby preserving well-trained knowledge from the source domain. Therefore, our approach preserves well-trained knowledge from the source domain. This effective strategy outperforms other state-of-the-art methods for image classification and semantic segmentation tasks on various benchmarks.

The problem of incremental learning is tackled in “Dense Network Expansion for Class Incremental Learning.” A new network expansion method, called dense network expansion (DNE), is proposed to achieve a better trade-off between accuracy and model complexity. This is accomplished by introducing dense connections between the intermediate layers of the task expert networks, which enable the knowledge transfer from old to new tasks via feature sharing and reusing. This sharing is implemented with a cross-task attention mechanism, based on a new task attention block (TAB), that fuses information across tasks. The DNE-based approach outperforms the previous state-of-the-art methods by a margin of 4% in terms of accuracy, with similar or even smaller model scale.

With “PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models” we propose a novel approach that enables zero-shot and few-shot, generalizable 3D part segmentation by leveraging the latest advances of pretrained language-vision models (LVMs). Currently, the LVMs can only operate on 2D images and thus cannot be directly applied to 3D part segmentation. We designed a 3D fusion module which processes the results from multiple views of an object, fuses them, and generates the part segmentation on the 3D point cloud, with compelling results against 3D benchmark datasets.

Workshops

CVPR 2023 Workshop on Autonomous Driving, paper: EGA-Depth: Efficient Guided Attention for Self-Supervised Multi-Camera Depth Estimation [creating better architectures]

CVPR 2023 Mobile AI Workshop, paper: DIFT: Dynamic Iterative Field Transforms for Memory Efficient Optical Flow [creating better architectures]

CVPR 2023 Mobile AI Workshop, paper: QuickSRNet Plain Single-Image Super-Resolution Architecture for Faster Inference on Mobile Platforms [creating better architectures]

CVPR 2023 Workshop on Learning with Limited Labelled Data for Image and Video Understanding, paper: Neural Transformation Network to Generate Diverse Views for Contrastive Learning [making the best use of data]

CVPR 2023 Embodied AI Workshop, paper: Situated real-time interaction with a virtually embodied avatar [making the best use of data]

Continuing to push the boundaries of AI

These are just some of our highlights from this year’s edition of CVPR. If you are at CVPR, drop by the Qualcomm booth to find out more about our research work, experience the demos live, and learn more about our machine learning job openings.

Ning Bi
VP of Engineering, Qualcomm Technologies

Fatih Porikli
Senior Director of Technology, Qualcomm Technologies

If you're building AI or vision-enabled products, you've come to the right place.