Devi Parikh, Research Scientist at Facebook AI Research (FAIR) and Assistant Professor at Georgia Tech, presents the “Words, Pictures, and Common Sense: Visual Question Answering” tutorial at the May 2018 Embedded Vision Summit.
Wouldn’t it be nice if machines could understand content in images and communicate this understanding as effectively as humans? Such technology would be immensely powerful, be it for aiding a visually-impaired user navigate a world built by the sighted, assisting an analyst in extracting relevant information from a surveillance feed, educating a child playing a game on a touch screen, providing information to a spectator at an art gallery, or interacting with a robot. As computer vision and natural language processing techniques are maturing, we are closer to achieving this dream than we have ever been.
Visual Question Answering (VQA) is one step in this direction. Given an image and a natural language question about the image (e.g., “What kind of store is this?”, “How many people are waiting in the queue?”, “Is it safe to cross the street?”), the machine’s task is to automatically produce an accurate natural language answer (“bakery”, “5”, “Yes”). In this talk, Parikh presents her research group’s dataset, the results it has obtained using neural models, and open research questions in free-form and open-ended VQA.