Multimodal Large Language Models

LLMs and MLLMs

The past decade-plus has seen incredible progress in practical computer vision. Thanks to deep learning, computer vision is dramatically more robust and accessible, and has enabled compelling capabilities in thousands of applications, from automotive safety to healthcare. But today’s widely used deep learning techniques suffer from serious limitations. Often, they struggle when confronted with ambiguity (e.g., are those people fighting or dancing?) or with challenging imaging conditions (e.g., is that shadow in the fog a person or a shrub?). And, for many product developers, computer vision remains out of reach due to the cost and complexity of obtaining the necessary training data, or due to lack of necessary technical skills.

Recent advances in large language models (LLMs) and their variants such as vision language models (VLMs, which comprehend both images and text), hold the key to overcoming these challenges. VLMs are an example of multimodal large language models (MLLMs), which integrate multiple data modalities such as language, images, audio, and video to enable complex cross-modal understanding and generation tasks. MLLMs represent a significant evolution in AI by combining the capabilities of LLMs with multimodal processing to handle diverse inputs and outputs.

The purpose of this portal is to facilitate awareness of, and education regarding, the challenges and opportunities in using LLMs, VLMs, and other types of MLLMs in practical applications — especially applications involving edge AI and machine perception. The content that follows (which is updated regularly) discusses these topics. As a starting point, we encourage you to watch the recording of the symposium “Your Next Computer Vision Model Might be an LLM: Generative AI and the Move From Large Language Models to Vision Language Models“, sponsored by the Edge AI and Vision Alliance. A preview video of the symposium introduction by Jeff Bier, Founder of the Alliance, follows:

If there are topics related to LLMs, VLMs or other types of MLLMs that you’d like to learn about and don’t find covered below, please email us at [email protected] and we’ll consider adding content on these topics in the future.

View all LLM and MLLM Content

Snapdragon Stories: Four Ways AI Has Improved My Life

October 16, 2025

This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm. I’ve used AI chat bots here and there, mostly for relatively simple and very specific tasks. But, I was underutilizing — and underestimating — how AI can quietly yet significantly reshape everyday moments. I don’t want to

Synaptics Launches the Next Generation of Astra Multimodal GenAI Processors to Power the Future of the Intelligent IoT Edge

October 15, 2025

San Jose, CA, October 15, 2025 – Synaptics® Incorporated (Nasdaq: SYNA) announces the new Astra™ SL2600 Series of multimodal Edge AI processors designed to deliver exceptional power and performance. The Astra SL2600 series enables a new generation of cost-effective intelligent devices that make the cognitive Internet of Things (IoT) possible. The SL2600 Series will launch

Open-source Physics Engine and OpenUSD Advance Robot Learning

October 14, 2025

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. The Newton physics engine and enhanced NVIDIA Isaac GR00T models enable developers to accelerate robot learning through unified OpenUSD simulation workflows. Editor’s note: This blog is a part of Into the Omniverse, a series focused on how

“Multimodal Enterprise-scale Applications in the Generative AI Era,” a Presentation from Skyworks Solutions

October 6, 2025

Mumtaz Vauhkonen, Senior Director of AI at Skyworks Solutions, presents the “Multimodal Enterprise-scale Applications in the Generative AI Era” tutorial at the May 2025 Embedded Vision Summit. As artificial intelligence is making rapid strides in use of large language models, the need for multimodality arises in multiple application scenarios. Similar… “Multimodal Enterprise-scale Applications in the

How to Integrate Computer Vision Pipelines with Generative AI and Reasoning

September 30, 2025

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. Generative AI is opening new possibilities for analyzing existing video streams. Video analytics are evolving from counting objects to turning raw video content footage into real-time understanding. This enables more actionable insights. The NVIDIA AI Blueprint for

“Unlocking Visual Intelligence: Advanced Prompt Engineering for Vision-language Models,” a Presentation from LinkedIn Learning

September 26, 2025

Alina Li Zhang, Senior Data Scientist and Tech Writer at LinkedIn Learning, presents the “Unlocking Visual Intelligence: Advanced Prompt Engineering for Vision-language Models” tutorial at the May 2025 Embedded Vision Summit. Imagine a world where AI systems automatically detect thefts in grocery stores, ensure construction site safety and identify patient… “Unlocking Visual Intelligence: Advanced Prompt

“Vision-language Models on the Edge,” a Presentation from Hugging Face

September 22, 2025

Cyril Zakka, Health Lead at Hugging Face, presents the “Vision-language Models on the Edge” tutorial at the May 2025 Embedded Vision Summit. In this presentation, Zakka provides an overview of vision-language models (VLMs) and their deployment on edge devices using Hugging Face’s recently released SmolVLM as an example. He examines… “Vision-language Models on the Edge,”

“Vision LLMs in Multi-agent Collaborative Systems: Architecture and Integration,” a Presentation from Google

September 19, 2025

Niyati Prajapati, ML and Generative AI Lead at Google, presents the “Vision LLMs in Multi-agent Collaborative Systems: Architecture and Integration” tutorial at the May 2025 Embedded Vision Summit. In this talk, Prajapati explores how vision LLMs can be used in multi-agent collaborative systems to enable new levels of capability and… “Vision LLMs in Multi-agent Collaborative

“Building Agentic Applications for the Edge,” a Presentation from GMAC Intelligence

September 18, 2025

Amit Mate, Founder and CEO of GMAC Intelligence, presents the “Building Agentic Applications for the Edge” tutorial at the May 2025 Embedded Vision Summit. Along with AI agents, the new generation of large language models, vision-language models and other large multimodal models are enabling powerful new capabilities that promise to… “Building Agentic Applications for the

Build High-performance Vision AI Pipelines with NVIDIA CUDA-accelerated VC-6

September 16, 2025

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. The constantly increasing compute throughput of NVIDIA GPUs presents a new opportunity for optimizing vision AI workloads: keeping the hardware fed with data. As GPU performance continues to scale, traditional data pipeline stages, such as I/O from

“Enabling Ego Vision Applications on Smart Eyewear Devices,” a Presentation from EssilorLuxottica

September 16, 2025

Francesca Palermo, Research Principal Investigator at EssilorLuxottica, presents the “Enabling Ego Vision Applications on Smart Eyewear Devices” tutorial at the May 2025 Embedded Vision Summit. Ego vision technology is revolutionizing the capabilities of smart eyewear, enabling applications that understand user actions, estimate human pose and provide spatial awareness through simultaneous… “Enabling Ego Vision Applications on

LLiMa: SiMa.ai’s Automated Code Generation Framework for LLMs and VLMs for <10W

September 15, 2025

This blog post was originally published at SiMa.ai’s website. It is reprinted here with the permission of SiMa.ai. In our blog post titled “Implementing Multimodal GenAI Models on Modalix”, we describe how SiMa.ai’s MLSoC Modalix enables Generative AI models to be implemented for Physical AI applications with low latency and low power consumption. We implemented

“Improving Worksite Safety with AI-powered Perception,” a Presentation from Arcure

September 4, 2025

Sabri Bayoudh, Chief Innovation Officer at Arcure, presents the “Improving Worksite Safety with AI-powered Perception” tutorial at the May 2025 Embedded Vision Summit. In this presentation, Bayoudhl explores how embedded vision is being used in industrial applications, including vehicle safety and production. He highlights some of the challenging requirements of… “Improving Worksite Safety with AI-powered

“Edge AI and Vision at Scale: What’s Real, What’s Next, What’s Missing?,” An Embedded Vision Summit Expert Panel Discussion

August 27, 2025

Sally Ward-Foxton, Senior Reporter at EE Times, moderates the “Edge AI and Vision at Scale: What’s Real, What’s Next, What’s Missing?” Expert Panel at the May 2025 Embedded Vision Summit. Other panelists include Chen Wu, Director and Head of Perception at Waymo, Vikas Bhardwaj, Director of AI in the Reality… “Edge AI and Vision at

“A View From the 2025 Embedded Vision Summit (Part 2),” a Presentation from the Edge AI and Vision Alliance

August 26, 2025

Jeff Bier, Founder of the Edge AI and Vision Alliance, welcomes attendees to the May 2025 Embedded Vision Summit on May 22, 2025. Bier provides an overview of the edge AI and vision market opportunities, challenges, solutions and trends. He also introduces the Edge AI and Vision Alliance and the… “A View From the 2025

“A View From the 2025 Embedded Vision Summit (Part 1),” a Presentation from the Edge AI and Vision Alliance