LLMs and MLLMs
The past decade-plus has seen incredible progress in practical computer vision. Thanks to deep learning, computer vision is dramatically more robust and accessible, and has enabled compelling capabilities in thousands of applications, from automotive safety to healthcare. But today’s widely used deep learning techniques suffer from serious limitations. Often, they struggle when confronted with ambiguity (e.g., are those people fighting or dancing?) or with challenging imaging conditions (e.g., is that shadow in the fog a person or a shrub?). And, for many product developers, computer vision remains out of reach due to the cost and complexity of obtaining the necessary training data, or due to lack of necessary technical skills.
Recent advances in large language models (LLMs) and their variants such as vision language models (VLMs, which comprehend both images and text), hold the key to overcoming these challenges. VLMs are an example of multimodal large language models (MLLMs), which integrate multiple data modalities such as language, images, audio, and video to enable complex cross-modal understanding and generation tasks. MLLMs represent a significant evolution in AI by combining the capabilities of LLMs with multimodal processing to handle diverse inputs and outputs.
The purpose of this portal is to facilitate awareness of, and education regarding, the challenges and opportunities in using LLMs, VLMs, and other types of MLLMs in practical applications — especially applications involving edge AI and machine perception. The content that follows (which is updated regularly) discusses these topics. As a starting point, we encourage you to watch the recording of the symposium “Your Next Computer Vision Model Might be an LLM: Generative AI and the Move From Large Language Models to Vision Language Models“, sponsored by the Edge AI and Vision Alliance. A preview video of the symposium introduction by Jeff Bier, Founder of the Alliance, follows:
If there are topics related to LLMs, VLMs or other types of MLLMs that you’d like to learn about and don’t find covered below, please email us at [email protected] and we’ll consider adding content on these topics in the future.
View all LLM and MLLM Content

Implementing Multimodal GenAI Models on Modalix
This blog post was originally published at SiMa.ai’s website. It is reprinted here with the permission of SiMa.ai. It has been our goal since starting SiMa.ai to create one software and hardware platform for the embedded edge that empowers companies to make their AI/ML innovations come to life. With the rise of Generative AI already

“Customizing Vision-language Models for Real-world Applications,” a Presentation from NVIDIA
Monika Jhuria, Technical Marketing Engineer at NVIDIA, presents the “Customizing Vision-language Models for Real-world Applications” tutorial at the May 2025 Embedded Vision Summit. Vision-language models (VLMs) have the potential to revolutionize various applications, and their performance can be improved through fine-tuning and customization. In this presentation, Jhuria explores the concept… “Customizing Vision-language Models for Real-world

XR Tech Market Report
Woodside Capital Partners (WCP) is pleased to share its XR Tech Market Report, authored by senior bankers Alain Bismuth and Rudy Burger, and by analyst Alex Bonilla. Why we are interested in the XR Ecosystem Investors have been pouring billions of dollars into developing enabling technologies for augmented reality (AR) glasses aimed at the consumer market,

The Era of Physical AI is Here
This blog post was originally published at SiMa.ai’s website. It is reprinted here with the permission of SiMa.ai. The AI landscape is undergoing a monumental shift. After a decade where AI flourished in the cloud, scaled by hyperscalers, we are now entering the era of Physical AI. Physical AI is poised to touch every facet

R²D²: Boost Robot Training with World Foundation Models and Workflows from NVIDIA Research
This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. As physical AI systems advance, the demand for richly labeled datasets is accelerating beyond what we can manually capture in the real world. World foundation models (WFMs), which are generative AI models trained to simulate, predict, and

“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applications,” a Presentation from Camio
Lazar Trifunovic, Solutions Architect at Camio, presents the “LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applications” tutorial at the May 2025 Embedded Vision Summit. By using vision-language models (VLMs) or combining large language models (LLMs) with conventional computer vision models, we can create vision systems that are… “LLMs and VLMs for Regulatory

Collaborating With Robots: How AI Is Enabling the Next Generation of Cobots
This blog post was originally published at Ambarella’s website. It is reprinted here with the permission of Ambarella. Collaborative robots, or cobots, are reshaping how we interact with machines. Designed to operate safely in shared environments, AI-enabled cobots are now embedded across manufacturing, logistics, healthcare, and even the home. But their role goes beyond automation—they

R²D²: Building AI-based 3D Robot Perception and Mapping with NVIDIA Research
This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. Robots must perceive and interpret their 3D environments to act safely and effectively. This is especially critical for tasks such as autonomous navigation, object manipulation, and teleoperation in unstructured or unfamiliar spaces. Advances in robotic perception increasingly

A World’s First On-glass GenAI Demonstration: Qualcomm’s Vision for the Future of Smart Glasses
This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm. Our live demo of a generative AI assistant running completely on smart glasses — without the aid of a phone or the cloud — and the reveal of the new Snapdragon AR1+ platform spark new possibilities for

We Built a Personalized, Multimodal AI Smart Glass Experience — Watch It Here
This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm. Our demo shows the power of on-device AI and why smart glasses make the ideal AI user interface Gabby walks into a gym while carrying a smartphone and wearing a pair of smart glasses. Unsure of where

AI Blueprint for Video Search and Summarization Now Available to Deploy Video Analytics AI Agents Across Industries
This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. The age of video analytics AI agents is here. Video is one of the defining features of the modern digital landscape, accounting for over 50% of all global data traffic. Dominant in media and increasingly important for

R²D²: Unlocking Robotic Assembly and Contact Rich Manipulation with NVIDIA Research
This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. This edition of NVIDIA Robotics Research and Development Digest (R2D2) explores several contact-rich manipulation workflows for robotic assembly tasks from NVIDIA Research and how they can address key challenges with fixed automation, such as robustness, adaptability, and

NVIDIA Powers Humanoid Robot Industry With Cloud-to-robot Computing Platforms for Physical AI
New NVIDIA Isaac GR00T Humanoid Open Models Soon Available for Download on Hugging Face GR00T-Dreams Blueprint Generates Data to Train Humanoid Robot Reasoning and Behavior NVIDIA RTX PRO 6000 Blackwell Workstations and RTX PRO Servers Accelerate Robot Simulation and Training Agility Robotics, Boston Dynamics, Foxconn, Lightwheel, NEURA Robotics and XPENG Robotics Among Many Robot Makers Adopting NVIDIA Isaac COMPUTEX—NVIDIA today announced VIDIA

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
This blog post was originally published at Nota AI’s website. It is reprinted here with the permission of Nota AI. Our method, Trimmed-Llama, reduces the key-value cache (KV cache) and latency of cross-attention-based Large Vision Language Models (LVLMs) without sacrificing performance. We identify sparsity in LVLM cross-attention maps, showing a consistent layer-wise pattern where most

Deploying an Efficient Vision-Language Model on Mobile Devices
This blog post was originally published at Nota AI’s website. It is reprinted here with the permission of Nota AI. Recent large language models (LLMs) have demonstrated unprecedented performance in a variety of natural language processing (NLP) tasks. Thanks to their versatile language processing capabilities, it has become possible to develop various NLP applications that

LM Studio Accelerates LLM Performance With NVIDIA GeForce RTX GPUs and CUDA 12.8
This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. Latest release of the desktop application brings enhanced dev tools and model controls, as well as better performance for RTX GPUs. As AI use cases continue to expand — from document summarization to custom software agents —