Advancing Perception Across Modalities with State-of-the-art AI

This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm.

Qualcomm AI Research papers at ICCV, Interspeech and more

The last few months have been exciting for the Qualcomm AI Research team, with the opportunity to present our latest papers and artificial intelligence (AI) demos at conferences such as ICCV (International Conference on Computer Vision), InterSpeech, GLOBECOM (Global Communications Conference), and ICIP (International Conference on Image Processing).

We have achieved state-of-the-art (SOTA) results across a variety of AI research areas in machine learning, computer vision, personalization and federated learning, and wireless/radio frequency (RF) sensing. This research can be applied to exciting use cases such as video streaming, autonomous driving, voice assistance, indoor localization, gaming for extended reality (XR) headsets, computer graphics and more. Below are some highlights from these conferences, along with the publication links for further reading.

Creating 3D


Oftentimes, three-dimensional (3D) imagery is not available, and the solution is 3D scene reconstruction from monocular images through the fusing of features. In the ICCV-accepted paper “DG-Recon: Depth-Guided Neural 3D Scene Reconstruction,” we leverage monocular depth priors, which:

  • effectively guide the fusion to improve surface prediction and
  • skip over irrelevant, ambiguous or occluded features.

Furthermore, we revisit the average-based fusion used by most neural 3D reconstruction methods and propose two alternatives: a variance-based and a cross-attention-based fusion module, that are more efficient and effective than the average-based and self-attention-based counterparts.

Compared to the baseline, the proposed depth-guided reconstruction (DG-Recon) models significantly improve the reconstruction quality and completeness while remaining in real time. Our method achieves SOTA online reconstruction results.

Video depth estimation

Taking the task of video depth estimation one step further, we note the need for the depth network to cross-reference relevant features from the past when predicting depth on the current frame. Continuing the ICCV series, “MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation” does just that.

This paper introduces:

  • A novel scheme to continuously update the memory, optimizing it to keep tokens that correspond with both the past and the present visual information.
  • Our attention-based approach to process memory features where we first learn the spatiotemporal relation among the resultant visual and displacement memory tokens using self-attention module.

Through extensive experiments on several benchmarks, we show that MAMo consistently improves monocular depth estimation networks and achieves superior accuracy.

Reflective surfaces

Depth estimation from monocular images is tricky when dealing with reflective surfaces. In the next ICCV paper “3D Distillation: Improving Self-Supervised Monocular Depth Estimation on Reflective Surfaces,” we propose a new method.

In this paper, we demonstrate:

  • How to minimize photometric loss, using spatially neighboring image pairs during training so reflective surfaces can be accurately reconstructed by aggregating the predicted depth of these views.
  • Motivated by this observation, we propose 3D distillation: a novel training framework that utilizes the projected depth of reconstructed reflective surfaces to generate reasonably accurate depth pseudo-labels.

We show the method not only significantly improves the prediction accuracy — especially on the problematic surfaces — but also that it generalizes well over various underlying network architectures and to new datasets.

Understanding 3D and 4D

Two more ICCV papers improve on current segmentation and image matching SOTA:

Equivariant neural networks

These are particularly useful for computer vision tasks because they are robust to rotation, translation and permutation. In the paper “4D Panoptic Segmentation as Invariant and Equivariant Field Prediction,” we develop rotation-equivariant neural networks for 4D panoptic segmentation.

  • This is especially useful for autonomous driving scenarios where sensing happens through light detection and ranging (LiDAR) scans.
  • Rotation-equivariance can provide better generalization and more robust feature learning.

Through measuring against benchmark data, we show our equivariant models achieve superior accuracy with lower computational costs compared to their non-equivariant counterparts.

Our method sets new SOTA performance and achieves the top place on the SemanticKITTI four dimensional (4D) Panoptic Segmentation leaderboard.

Image matching

We tackle the challenge of matching together different images for 3D applications in “GlueStick: Robust Image Matching by Sticking Points and Lines Together.” Here, we:

  • Introduce a new matching paradigm, where points, lines and their descriptors are unified into a single wireframe structure.
  • Take two wireframes from different images and leverage the connectivity information between nodes to better glue them together.
  • Show that our matching strategy outperforms the SOTA approaches, independently matching line segments and points for a wide variety of datasets and tasks. The code is available.

Computer vision for graphics and gaming

XR material and lighting estimation

In plenty of XR use cases, material and lighting estimation prove to be challenging. We address the task of estimating these components of an indoor scene based on image observations in the ICCV-accepted paper, “Factorized Inverse Path Tracing for Efficient and Accurate Material-Lighting Estimation.”

  • Our Factorized Inverse Path Tracing (FIPT) uses a factored light transport formulation and finds emitters driven by rendering errors.
  • The algorithm enables accurate material and lighting optimization faster than previous work and is more effective at resolving ambiguities. The source code is available.

Video game rendering

Furthermore, we tackle machine learning for real-time rendering in video games. This is a challenging task due to the need for:

  • high resolutions,
  • high frame rates, and
  • photorealism.

Super sampling has emerged as an effective solution to address this challenge. We introduce a novel method to achieve this goal efficiently and accurately in the next ICCV paper, “Efficient Neural Super Sampling on a Novel Gaming Dataset.”

Using this method leads to super resolution of the rendered content four times more efficiently than with existing methods while maintaining the same level of accuracy. Additionally, we introduce a new dataset which provides auxiliary modalities such as motion vectors and depth.

Efficient video perception

Video cross-frame redundancies

Video cross-frame redundancies can be leveraged to accelerate video perception. This is something we have done in our ICCV-presented research for “ResQ: Residual Quantization for Video Perception.”

  • We observe that residuals, as the difference in network activations between two neighboring frames, exhibit properties that make them highly quantizable.
  • Based on this observation, we propose a novel quantization scheme for video networks that extends the standard, frame-by-frame, quantization scheme by incorporating temporal dependencies that lead to better performance in terms of accuracy versus bit-width.
  • We demonstrate the superiority of our model against the standard quantization and existing efficient video perception models.


Working with unlabeled or minimally labelled data requires cross-correlation mechanisms. In the task of action localization, the alignment between query and support videos is important. This can be attained by representing the common action cues of interest from the support videos considering the query video’s context. That is what we aim to achieve with our ICCV-accepted paper, “Few-Shot Common Action Localization via Cross-Attentional Fusion of Context and Temporal Dynamics.” We show the effectiveness of our work with SOTA performance in benchmark datasets and analyze each component extensively.

Test-time adaptation

Label shift

When we talk about test-time adaptation (TTA), we are referring to adapting a pretrained model to the target domain in a batch-by-batch manner during inference. Certain classes appear more frequently in certain domains (e.g., buildings in cities and trees in forests), so it is natural that the label distribution shifts as the domain changes. This is what we explore in our Interspeech-accepted paper, “Label Shift Adapter for Test-Time Adaptation under Covariate and Label Shifts.”

  • We propose a novel label shift adapter that can be incorporated into existing TTA approaches to deal with label shifts during the TTA process effectively.
  • Our approach is computationally efficient and can be easily applied, regardless of the model architecture.


TTA methods suffer from noisy signals originating from incorrect or open-set predictions. We address this challenge in our ICCV-accepted paper, “Towards Open-Set Test-Time Adaptation Utilizing the Wisdom of Crowds in Entropy Minimization,” with a simple, yet effective sample selection method.

  • Individual confidence values may rise or fall due to the influence of signals from numerous other predictions (i.e., wisdom of crowds).
  • Due to this fact, noisy signals fail to raise the individual confidence values of wrong samples, despite attempts to increase them.
  • Based on such findings, we filter out the samples whose confidence values are lower in the adapted model than in the original model, as they are likely to be noisy.

Our method is widely applicable to existing TTA methods and improves performance in both image classification and semantic segmentation.

Wireless sensing

Estimating path loss for a transmitter-receiver location is key to many use cases including network planning and handover. Machine learning has become a popular tool to predict wireless channel properties based on map data.

In our paper “Transformer-Based Neural Surrogate for Link-Level Path Loss Prediction from Variable-Sized Maps,” accepted at GLOBECOM, we introduce a transformer-based neural network architecture that enables predicting link-level properties from maps of various dimensions and from sparse measurements.

  • The map contains information about buildings and foliage.
  • The transformer model attends to the regions that are relevant for path loss prediction and therefore, scales efficiently to maps of distinct sizes.
  • In experiments, we show that the proposed model can efficiently learn dominant path losses from sparse training data and generalize well when tested on novel maps.

Federated learning

Federated learning (FL) enables distributed machine learning model training on edge devices, ensuring data privacy. However, managing such training with the devices’ limited resources, heterogeneous architectures and unpredictable availability is challenging.

To address these challenges and improve on-device training for mobile devices, at Interspeech we presented a joint effort by Qualcomm Technologies and Microsoft called, “Federated Learning Toolkit with Voice-based User Verification Demo.”

The demonstration includes a technical display of FL on a device powered by Snapdragon technology, coordinated through Microsoft Florida, and a federated user verification example using voice samples.

Efficient speech recognition

Keyword models

Few-shot keyword spotting models usually require large-scale annotated datasets to generalize to unseen target keywords. However, existing datasets are limited in scale and gathering keyword-like labelled data is costly.

To mitigate this issue, in the Interspeech-accepted paper “Improving Small Footprint Few-shot Keyword Spotting with Supervision on Auxiliary Data,” we propose a framework that uses easily collectible, unlabeled reading speech data as an auxiliary source.

  • We automatically annotate and filter the data to construct a keyword-like dataset, LibriWord, enabling supervision on auxiliary data.
  • We then adopt multi-task learning that helps the model enhance the representation power from out-of-domain auxiliary data.
  • Our method notably improves the performance over competitive benchmark methods.

Streaming automatic speech recognition (ASR)

ASR models are restricted from accessing future context, which results in worse performance compared to the non-streaming models. To improve the performance of streaming ASR, knowledge distillation (KD) from the non-streaming to streaming model has been studied. In the paper “Knowledge Distillation from Non-streaming to Streaming ASR Encoder using Auxiliary Non-streaming Layer,” which we also presented at Interspeech 2023, we propose a layer-to-layer KD from the teacher encoder to the student encoder.

  • We designed a special KD loss that leverages the autoregressive predictive coding (APC) mechanism to encourage the streaming model to predict unseen future contexts.
  • Experimental results show that the proposed method can significantly reduce the word error rate compared to previous token probability distillation methods.

This is a comprehensive — but not exhaustive — list of our recently published work. Stay tuned for our next blog post about our published research and the demos that we brought to NeurIPS 2023.

Fatih Porikli
Senior Director of Technology, Qualcomm Technologies

Armina Stepan
Senior Marketing Comms Coordinator, Qualcomm Technologies Netherlands B.V.

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.



1646 N. California Blvd.,
Suite 360
Walnut Creek, CA 94596 USA

Phone: +1 (925) 954-1411
Scroll to Top