Adel Ahmadyan, Staff Engineer at Meta Reality Labs, presents the “Bridging Vision and Language: Designing, Training and Deploying Multimodal Large Language Models” tutorial at the May 2024 Embedded Vision Summit.
In this talk, Ahmadyan explores the use of multimodal large language models in real-world edge applications. He begins by explaining how these large multimodal models (LMMs) work and highlighting their key components, giving special attention to how LMMs merge understanding in the vision and language domains.
Next, Ahmadyan discusses the process of training LMMs and the types of data needed to tune them for specific tasks. Finally, he highlights some of the key challenges in deploying LMMs in resource-constrained edge devices and shares techniques for overcoming these challenges.
See here for a PDF of the slides.