Vladimir Haltakov, Self-Driving Car Engineer at BMW Group, presents the “Data Collection in the Wild” tutorial at the May 2021 Embedded Vision Summit.
In scientific papers, computer vision models are usually evaluated on well-defined training and test datasets. In practice, however, collecting high-quality data that accurately represents the real world is a challenging problem. Developing models using a non-representative dataset will give high accuracy during testing, but the model will perform poorly when deployed in the real world.
In this presentation, Haltakov discusses the challenges, common pitfalls and possible solutions for creating datasets for real-world problems. He also discusses how to avoid typical biases while curating the data, and dives deep into imbalanced distributions and presents techniques on how to handle them. Finally, he discusses strategies to detect and deal with model drift after a model is deployed in production.
See here for a PDF of the slides.