[VGS-IT] Learning visual representations from Internet data

FIT VUT v Brně 22.4.2016

The next speaker in VGS-IT series will be Josef Sivic.

The talk will be given on Friday, April 22 at 10:30 in room E105.

Title: Learning visual representations from Internet data

Abstract: Unprecedented amount of visual data is now available on the Internet. Wouldn't it be great if a machine could automatically learn from this data? For example, imagine a machine that can learn how to change a flat tire of a car by watching instruction videos on Youtube, or that can learn how to navigate in a city by observing street-view imagery. Learning from Internet data is, however, a very challenging problem as the data is equipped only with weak supervisory signals such as human narration of the instruction video or noisy geotags for street-level imagery. In this talk, I will describe our recent progress on learning visual representations from such weakly annotated visual data.

In the first part of the talk, I will describe a new convolutional neural network architecture that is trainable in an end-to-end manner for the visual place recognition task. I will show that the network can be trained from weakly annotated Google Street View Time Machine imagery and significantly improves over current state-of-the-art in visual place recognition.

In the second part of the talk, I will describe a technique for automatically learning the main steps to complete a certain task, such as changing a car tire, from a set of narrated instruction videos. The method solves two clustering problems, one in text and one in video, linked by joint constraints to obtain a single coherent sequence of steps in both modalities. I will show results on a newly collected dataset of instruction videos from Youtube that include complex interactions between people and objects, and are captured in a variety of indoor and outdoor settings.

All are cordially invited.

