Trevor Darrell, Professor at the University of California, Berkeley, presents the “Future of Visual AI: Efficient Multimodal Intelligence” tutorial at the May 2025 Embedded Vision Summit.
AI is on the cusp of a revolution, driven by the convergence of several breakthroughs. One of the most significant of these advances is the development of large language models (LLMs) that can reason like humans, enabling them to make decisions and take actions based on complex, nuanced inputs. Another is the integration of natural language processing and computer vision through vision-language models (VLMs). In this keynote talk, Darrell shares his perspective on the current state and trajectory of research advancing machine intelligence. He presents highlights of his group’s groundbreaking work, including methods for training vision models when labeled data is unavailable and techniques that enable robots to determine appropriate actions in novel situations.
Particularly relevant to edge applications, much of Darrell’s work aims to overcome obstacles—such as massive memory and compute requirements—that limit the practical applications of state-of-the-art models. For example, he discusses approaches to making VLMs smaller and more efficient while retaining accuracy. He also shows how LLMs can be used as visual reasoning coordinators, overseeing the use of multiple task-specific models to enable superior performance. Darrell also demonstrates how multimodal AI, visual perception and prompt-tuned reasoning are enabling consumers to utilize visual intelligence at home while preserving privacy.
See here for a PDF of the slides.

