“Frontiers in Perceptual AI: First-person Video and Multimodal Perception,” a Keynote Presentation from Kristen Grauman

Kristen Grauman, Professor at the University of Texas at Austin and Research Director at Facebook AI Research, presents the “Frontiers in Perceptual AI: First-person Video and Multimodal Perception” tutorial at the May 2023 Embedded Vision Summit.

First-person or “egocentric” perception requires understanding the video and multimodal data that streams from wearable cameras and other sensors. The egocentric view offers a special window into the camera wearer’s attention, goals, and interactions with people and objects in the environment, making it an exciting avenue for both augmented reality and robot learning. The multimodal nature is particularly compelling, with opportunities to bring together audio, language, and vision.

Grauman begins her presentation by introducing Ego4D, a massive new open-sourced multimodal egocentric dataset that captures the daily-life activity of people around the world. The result of a multi-year, multi-institution effort, Ego4D pushes the frontiers of first-person multimodal perception with a suite of research challenges ranging from activity anticipation to audio-visual conversation.

Building on this resource, Grauman presents her group’s ideas for searching egocentric videos with natural language queries (“Where did I last see X? Did I leave the garage door open?”), injecting semantics from text and speech into powerful video representations, and learning audio-visual models to understand a camera wearer’s physical environment or augment their hearing in busy places. She also touches on interesting performance-oriented challenges raised by having very long video sequences (hours!) and ideas for learning to scale retrieval and encoders.

See here for a PDF of the slides.

