The Alliance is delighted to host a thought-provoking keynote presentation at the Embedded Vision Summit (May 22-24 in Santa Clara, California) by renowned AI expert Kristen Grauman, Professor of Computer Science at the University of Texas at Austin and Research Scientist at Facebook AI Research. Her talk will launch our 2023 Summit program, focused on creating accurate, efficient perceptual AI implementations that can be deployed to solve real world problems.
Dr. Grauman’s keynote, “Frontiers in Perceptual AI: First-Person Video and Multimodal Perception,” will illuminate some of the most exciting leading-edge work going on today in our field. Among other topics, she will share recent insights on how to enable more robust machine perception by using data from different types of sensors, and by combining natural language and sensor data.
Here’s the full abstract from Dr. Grauman:
First-person or “egocentric” perception requires understanding the video and multimodal data that streams from wearable cameras and other sensors. The egocentric view offers a special window into the camera wearer’s attention, goals, and interactions with people and objects in the environment, making it an exciting avenue for both augmented reality and robot learning. The multimodal nature is particularly compelling, with opportunities to bring together audio, language and vision.
To begin, I’ll introduce Ego4D, a massive new open-sourced multimodal egocentric dataset that captures the daily-life activity of people around the world. The result of a multi-year, multi-institution effort, Ego4D pushes the frontiers of first-person multimodal perception with a suite of research challenges ranging from activity anticipation to audio-visual conversation.
Building on this resource, I’ll present our ideas for searching egocentric videos with natural language queries (“Where did I last see X? Did I leave the garage door open?”), injecting semantics from text and speech into powerful video representations and learning audio-visual models to understand a camera wearer’s physical environment or augment their hearing in busy places. I’ll also touch on interesting performance-oriented challenges raised by having very long video sequences (hours!) and ideas for learning to scale retrieval and encoders.
For more on Dr. Grauman’s presentation, including her bio, please see her session page on the Embedded Vision Summit website.