Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities
Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains. Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets. Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss. This loss encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery. We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets. When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS.
Author(s): | Andrii Zadaianchuk and Maximilian Seitzer and Georg Martius |
Book Title: | Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023) |
Year: | 2023 |
Month: | December |
Bibtex Type: | Conference Paper (inproceedings) |
Event Name: | Advances in Neural Information Processing Systems 36 |
Event Place: | New Orleans, USA |
URL: | https://openreview.net/forum?id=t1jLRFvBqm |
Electronic Archiving: | grant_archive |
Links: |
BibTex
@inproceedings{Zadaianchuk2023VideoSAUR, title = {Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities}, booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023)}, abstract = {Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains. Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets. Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss. This loss encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery. We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets. When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS.}, month = dec, year = {2023}, slug = {zadaianchuk2023videosaur}, author = {Zadaianchuk, Andrii and Seitzer, Maximilian and Martius, Georg}, url = {https://openreview.net/forum?id=t1jLRFvBqm}, month_numeric = {12} }