Temporal Interpolation as an Unsupervised Pretraining Task for Optical Flow Estimation

The difficulty of annotating training data is a major obstacle to using CNNs for low-level tasks in video. Synthetic data often does not generalize to real videos, while unsupervised methods require heuristic n losses. Proxy tasks can overcome these issues, and start by training a network for a task for which annotation is easier or which can be trained unsupervised. The trained network is then fine-tuned for the original task using small amounts of ground truth data. Here, we investigate frame interpolation as a proxy task for optical flow. Using real movies, we train a CNN unsupervised for temporal interpolation. Such a network implicitly estimates motion, but cannot handle untextured regions. By fine-tuning on small amounts of ground truth flow, the network can learn to fill in homogeneous regions and compute full optical flow fields. Using this unsupervised pre-training, our network outperforms similar architectures that were trained supervised using synthetic optical flow.
Author(s): | Jonas Wulff and Michael J. Black |
Book Title: | German Conference on Pattern Recognition (GCPR) |
Volume: | LNCS 11269 |
Pages: | 567--582 |
Year: | 2018 |
Month: | October |
Publisher: | Springer, Cham |
Project(s): | |
Bibtex Type: | Conference Paper (inproceedings) |
DOI: | https://doi.org/10.1007/978-3-030-12939-2_39 |
Electronic Archiving: | grant_archive |
Links: |
BibTex
@inproceedings{Wulff:GCPR:2018, title = {Temporal Interpolation as an Unsupervised Pretraining Task for Optical Flow Estimation}, booktitle = {German Conference on Pattern Recognition (GCPR)}, abstract = {The difficulty of annotating training data is a major obstacle to using CNNs for low-level tasks in video. Synthetic data often does not generalize to real videos, while unsupervised methods require heuristic n losses. Proxy tasks can overcome these issues, and start by training a network for a task for which annotation is easier or which can be trained unsupervised. The trained network is then fine-tuned for the original task using small amounts of ground truth data. Here, we investigate frame interpolation as a proxy task for optical flow. Using real movies, we train a CNN unsupervised for temporal interpolation. Such a network implicitly estimates motion, but cannot handle untextured regions. By fine-tuning on small amounts of ground truth flow, the network can learn to fill in homogeneous regions and compute full optical flow fields. Using this unsupervised pre-training, our network outperforms similar architectures that were trained supervised using synthetic optical flow.}, volume = {LNCS 11269}, pages = {567--582}, publisher = {Springer, Cham}, month = oct, year = {2018}, slug = {wulff-gcpr-2018}, author = {Wulff, Jonas and Black, Michael J.}, month_numeric = {10} }