Multimodal Large Language Models

LLMs and MLLMs

The past decade-plus has seen incredible progress in practical computer vision. Thanks to deep learning, computer vision is dramatically more robust and accessible, and has enabled compelling capabilities in thousands of applications, from automotive safety to healthcare. But today’s widely used deep learning techniques suffer from serious limitations. Often, they struggle when confronted with ambiguity (e.g., are those people fighting or dancing?) or with challenging imaging conditions (e.g., is that shadow in the fog a person or a shrub?). And, for many product developers, computer vision remains out of reach due to the cost and complexity of obtaining the necessary training data, or due to lack of necessary technical skills.

Recent advances in large language models (LLMs) and their variants such as vision language models (VLMs, which comprehend both images and text), hold the key to overcoming these challenges. VLMs are an example of multimodal large language models (MLLMs), which integrate multiple data modalities such as language, images, audio, and video to enable complex cross-modal understanding and generation tasks. MLLMs represent a significant evolution in AI by combining the capabilities of LLMs with multimodal processing to handle diverse inputs and outputs.

The purpose of this portal is to facilitate awareness of, and education regarding, the challenges and opportunities in using LLMs, VLMs, and other types of MLLMs in practical applications — especially applications involving  edge AI and machine perception. The content that follows (which is updated regularly) discusses these topics. As a starting point, we encourage you to watch the recording of the symposium “Your Next Computer Vision Model Might be an LLM: Generative AI and the Move From Large Language Models to Vision Language Models“, sponsored by the Edge AI and Vision Alliance. A preview video of the symposium introduction by Jeff Bier, Founder of the Alliance, follows:


If there are topics related to LLMs, VLMs or other types of MLLMs that you’d like to learn about and don’t find covered below, please email us at [email protected] and we’ll consider adding content on these topics in the future.

View all LLM and MLLM Content

R²D²: Unlocking Robotic Assembly and Contact Rich Manipulation with NVIDIA Research

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. This edition of NVIDIA Robotics Research and Development Digest (R2D2) explores several contact-rich manipulation workflows for robotic assembly tasks from NVIDIA Research and how they can address key challenges with fixed automation, such as robustness, adaptability, and

Read More »

NVIDIA Powers Humanoid Robot Industry With Cloud-to-robot Computing Platforms for Physical AI

New NVIDIA Isaac GR00T Humanoid Open Models Soon Available for Download on Hugging Face GR00T-Dreams Blueprint Generates Data to Train Humanoid Robot Reasoning and Behavior NVIDIA RTX PRO 6000 Blackwell Workstations and RTX PRO Servers Accelerate Robot Simulation and Training Agility Robotics, Boston Dynamics, Foxconn, Lightwheel, NEURA Robotics and XPENG Robotics Among Many Robot Makers Adopting NVIDIA Isaac COMPUTEX—NVIDIA today announced VIDIA

Read More »

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

This blog post was originally published at Nota AI’s website. It is reprinted here with the permission of Nota AI. Our method, Trimmed-Llama, reduces the key-value cache (KV cache) and latency of cross-attention-based Large Vision Language Models (LVLMs) without sacrificing performance. We identify sparsity in LVLM cross-attention maps, showing a consistent layer-wise pattern where most

Read More »

Deploying an Efficient Vision-Language Model on Mobile Devices

This blog post was originally published at Nota AI’s website. It is reprinted here with the permission of Nota AI. Recent large language models (LLMs) have demonstrated unprecedented performance in a variety of natural language processing (NLP) tasks. Thanks to their versatile language processing capabilities, it has become possible to develop various NLP applications that

Read More »

Advancing Generative AI at the Edge During CES 2025

This blog post was originally published at Ambarella’s website. It is reprinted here with the permission of Ambarella. For this year’s CES, our theme was Your GenAI Edge—highlighting how Ambarella’s AI SoCs continue to redefine what’s possible with generative AI at the edge. Building on last year’s edge GenAI demos, we debuted a new 25-stream,

Read More »

R²D²: Adapting Dexterous Robots with NVIDIA Research Workflows and Models

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. Robotic arms are used today for assembly, packaging, inspection, and many more applications. However, they are still preprogrammed to perform specific and often repetitive tasks. To meet the increasing need for adaptability in most environments, perceptive arms

Read More »

Using AI to Better Understand the Ocean

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. Humans know more about deep space than we know about Earth’s deepest oceans. But scientists have plans to change that—with the help of AI. “We have better maps of Mars than we do of our own exclusive

Read More »

Rockets to Retail: Intel Core Ultra Delivers Edge AI for Video Management

At Intel Vision, Network Optix debuts natural language prompt prototype to redefine video management, offering industries faster AI-driven insights and efficiency. On the surface, aerospace manufacturers, shopping malls, universities, police departments and automakers might not have a lot in common. But they each collectively use and manage hundreds to thousands of video cameras across their

Read More »

R²D²: Advancing Robot Mobility and Whole-body Control with Novel Workflows and AI Foundation Models from NVIDIA Research

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. Welcome to the first edition of the NVIDIA Robotics Research and Development Digest (R2D2). This technical blog series will give developers and researchers deeper insight and access to the latest physical AI and robotics research breakthroughs across

Read More »

Ambarella Debuts Next-generation Edge GenAI Technology at ISC West, Including Reasoning Models Running on its CVflow Edge AI SoCs

With Over 30 Million Edge AI Systems-on-Chip Shipped, Ambarella is Driving Innovation for a Broad Range of On-Device and On-Premise Generative AI Applications SANTA CLARA, Calif., March 31, 2025 — Ambarella, Inc. (NASDAQ: AMBA), an edge AI semiconductor company, today announced during the ISC West security expo that it is continuing to push the envelope

Read More »

Video Understanding: Qwen2-VL, An Expert Vision-language Model

This article was originally published at Tenyks’ website. It is reprinted here with the permission of Tenyks. Qwen2-VL, an advanced vision language model built on Qwen2 [1], sets new benchmarks in image comprehension across varied resolutions and ratios, while also tackling extended video content. ‍Though Qwen2-V excels at many fronts, this article explores the model’s

Read More »

NVIDIA Announces Isaac GR00T N1 — the World’s First Open Humanoid Robot Foundation Model — and Simulation Frameworks to Speed Robot Development

Now Available, Fully Customizable Foundation Model Brings Generalized Skills and Reasoning to Humanoid Robots NVIDIA, Google DeepMind and Disney Research Collaborate to Develop Next-Generation Open-Source Newton Physics Engine New Omniverse Blueprint for Synthetic Data Generation and Open-Source Dataset Jumpstart Physical AI Data Flywheel March 18, 2025—GTC—NVIDIA today announced a portfolio of technologies to supercharge humanoid

Read More »

NVIDIA Announces Major Release of Cosmos World Foundation Models and Physical AI Data Tools

New Models Enable Prediction, Controllable World Generation and Reasoning for Physical AI Two New Blueprints Deliver Massive Physical AI Synthetic Data Generation for Robot and Autonomous Vehicle Post-Training 1X, Agility Robotics, Figure AI, Skild AI Among Early Adopters March 18, 2025—GTC—NVIDIA today announced a major release of new NVIDIA Cosmos™ world foundation models (WFMs), introducing

Read More »

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.

Contact

Address

Berkeley Design Technology, Inc.
PO Box #4446
Walnut Creek, CA 94596

Phone
Phone: +1 (925) 954-1411
Scroll to Top