Devi Parikh, Research Scientist at Facebook AI Research (FAIR) and Assistant Professor at Georgia Tech, presents the “Words, Pictures, and Common Sense: Visual Question Answering” tutorial at the May 2018 Embedded Vision Summit.

Wouldn’t it be nice if machines could understand content in images and communicate this understanding as effectively as humans? Such technology would be immensely powerful, be it for aiding a visually-impaired user navigate a world built by the sighted, assisting an analyst in extracting relevant information from a surveillance feed, educating a child playing a game on a touch screen, providing information to a spectator at an art gallery, or interacting with a robot. As computer vision and natural language processing techniques are maturing, we are closer to achieving this dream than we have ever been.

Visual Question Answering (VQA) is one step in this direction. Given an image and a natural language question about the image (e.g., “What kind of store is this?”, “How many people are waiting in the queue?”, “Is it safe to cross the street?”), the machine’s task is to automatically produce an accurate natural language answer (“bakery”, “5”, “Yes”). In this talk, Parikh presents her research group’s dataset, the results it has obtained using neural models, and open research questions in free-form and open-ended VQA.

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.



1646 N. California Blvd.,
Suite 360
Walnut Creek, CA 94596 USA

Phone: +1 (925) 954-1411
Scroll to Top