We Built a Personalized, Multimodal AI Smart Glass Experience — Watch It Here

This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm.

Our demo shows the power of on-device AI and why smart glasses make the ideal AI user interface

Gabby walks into a gym while carrying a smartphone and wearing a pair of smart glasses. Unsure of where to start, she surveys the fitness area and spots a yoga mat, kettlebells and resistance bands. Without lifting her smartphone, she utters a simple voice command for her smart glasses to capture an image of the equipment, letting her ask the digital assistant for a workout recommendation.

The digital assistant taps into the personalized data and preferences stored on her phone to come up with a specific recommendation.

A gentle yoga routine, safe for someone expecting. Instructions pop up as a picture-in-picture display in her glasses, prompting her toward the yoga mat. This isn’t a hypothetical exercise. Rather, it is a real-world example of the types of benefits that can come when smart glasses work in tandem with a smartphone and a generative AI-powered digital assistant, taking advantage of their combined processing power to run smaller large language models (LLMs) and large multimodal models (LMMs).

The video above illustrates the exercise  experience that Qualcomm Technologies demoed at Mobile World Congress in Barcelona in March. Read on to learn more about the demo and why this is just the start of the marriage between smart glasses and AI.

What is an LLM and an LMM?

A large language nodel (LLM) is a type of AI model designed to process and generate text. It’s trained on vast amounts of text data and can perform tasks like translation, question answering and content generation. A large multimodal model (LMM) is capable of processing and generating not only text but also multiple data types, including text, photos, audio and video. Their ability to integrate multiple sources of data helps to produce more accurate and relevant outputs than relying only on text data for training and inferencing. The results are contextually relevant and personalized recommendations tailored to specific users that go beyond anything a simple workout app could provide.

Breaking down the experience

Attendees walking into the Qualcomm Technologies booth at Mobile World Congress conference could put on a pair of RayNeo X3 Pro smart glasses powered by a Snapdragon AR1 Gen 1 processor-paired to a Snapdragon 8 Elite reference smartphone for the demo.

To start, the attendee would choose a persona, which ranged from Gabby, a pregnant woman, to Henry, a senior male with knee issues. Using the RayNeo X3 Pro glasses, the attendee snapped a photo of the scene at the gym and asked the assistant: “What workout should I do today with the following equipment?” The glasses provided a personalized response based on what they know about the user and the persona they chose.

But how did it work? Both the audio from the person’s question and the photo snapped by the RayNeo X3 Pro were sent to the reference smartphone, where a bulk of the computations were done utilizing an LLM, llava-llama-3-8b, which can process multiple inputs such as language and images.

At the same time, the phone employed Retrieval-Augmented Generation (RAG), a technique that grabs specific information from a knowledge base and feeds it into the LLMs for a more tailored response. In this case, the RAG technique grabbed data such as age, health, routines, hobbies, favorite meals and more from each persona, allowing the assistant to offer more custom recommendations.

To deliver these results, you need a powerful and purpose-built processor that accommodates both a lightweight design and the ability to process AI on the device. The Snapdragon AR1 Gen 1 processor is equipped with the Hexagon NPU, which is designed to process generative AI and inputs from a high-quality camera ISP capable of processing images at the highest quality regardless of lighting conditions or motion.

What is RAG?

RAG, or Retrieval-Augmented Generation, is a technique in generative AI (GenAI) that combines the strengths of retrieval-based and generative models. It works by first retrieving relevant information from a knowledge base established from the user’s inputs  and then using this information to generate a more accurate and contextually relevant response. This approach enhances the accuracy and personalization of AI outputs, reducing the risk of generating incorrect or irrelevant information.

Benefits of on-device AI

A key point of the experience was to show how the AI work could be done locally, with some of the data preprocessing done on the glasses themselves and the smartphone handling the rest.

There are multiple advantages to running inference at the edge, starting with personalization. A GenAI assistant that runs on your glasses and your smartphone can offer more tailored responses and recommendations based on the detailed personal information stored on the device.

By keeping things local, personal and sensitive data and preferences remain on the smartphone. On-device processing is also less expensive to run for the service, with continuous real-time interactions potentially costly to run on the cloud. Eliminating the need to keep pinging the cloud results in a more responsive experience with lower latency. That’s particularly important if you’re running an app or query in an enterprise setting, such as a hospital or bank.

At the same time, LLMs and LMMs are becoming smaller without compromising on the quality of results, making them easier to run on device.

Smart glasses the ideal AI interface

The demo illustrates why smart glasses are the ideal home for an AI assistant. Glasses outfitted with a camera and microphone can see and hear what’s around you. They    process a wide range of inputs, whether it’s voice commands, photos and even gestures.

Currently, glasses powered by a Snapdragon AR1 can run multimodal models on-glass, independently from the phone or the cloud to further enhance personalization and create a seamless user experience. Or, if the glasses need to harness more processing power, the companion smartphone is just a quick Bluetooth or Wi-Fi link away.

We’re already working on running more use cases and different applications on AR glasses or through smartphones as we move closer to our ultimate vision for where this is all going.

After all, beneath both devices are powerful Snapdragon processors that are only getting smarter. With time, I’m confident this demo will be a feature that just about anyone can regularly take advantage of.

Ziad Asghar
SVP & GM, XR, Qualcomm Technologies, Inc.

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.

Contact

Address

Berkeley Design Technology, Inc.
PO Box #4446
Walnut Creek, CA 94596

Phone
Phone: +1 (925) 954-1411
Scroll to Top