Jacob Marks, Senior ML Engineer and Researcher at Voxel51, presents the “How Large Language Models Are Impacting Computer Vision” tutorial at the May 2024 Embedded Vision Summit.
Large language models (LLMs) are revolutionizing the way we interact with computers and the world around us. However, in order to truly understand the world, LLM-powered agents need to be able to see.
Will models in production be multimodal, or will text-only LLMs leverage purpose-built vision models as tools? Where do techniques like multimodal retrieval-augmented generation (RAG) fit in? In this talk, Marks gives an overview of key LLM-centered projects that are reshaping the field of computer vision and discusses where we are headed in a multimodal world.
See here for a PDF of the slides.