Representing Objects in Video as Space-Time Volumes by Combining Top-Down and Bottom-Up Processes

Filip Ilic; Axel Pinz

Representing Objects in Video as Space-Time Volumes by Combining Top-Down and Bottom-Up Processes

Publikation: Beitrag in Buch/Bericht/Konferenzband › Beitrag in einem Konferenzband › Begutachtung

Abstract

As top-down based approaches of object recognition from video are getting more powerful, a structured way to combine them with bottom-up grouping processes becomes feasible. When done right, the resulting representation is able to describe objects and their decomposition into parts at appropriate spatio-temporal scales. We propose a method that uses a modern object detector to focus on salient structures in video, and a dense optical flow estimator to supplement feature extraction. From these structures we extract space-time volumes of interest (STVIs) by smoothing in spatio-temporal Gaussian Scale Space that guides bottom-up grouping. The resulting novel representation enables us to analyze and visualize the decomposition of an object into meaningful parts while preserving temporal object continuity. Our experimental validation is twofold. First, we achieve competitive results on a common video object segmentation benchmark. Second, we extend this benchmark with high quality object part annotations, DAVIS Parts ¹, on which we establish a strong baseline by showing that our method yields spatio-temporally meaningful object parts. Our new representation will support applications that require high-level space-time reasoning at the parts level.

Originalsprache	englisch
Titel	Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020
Seiten	1903-1911
Seitenumfang	9
ISBN (elektronisch)	9781728165530
Publikationsstatus	Veröffentlicht - 1 März 2020
Veranstaltung	2020 IEEE/CVF Winter Conference on Applications of Computer Vision: WACV 2020 - Snowmass Village, USA / Vereinigte Staaten Dauer: 1 März 2020 → 5 März 2020

Publikationsreihe

Name	Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020

Konferenz

Konferenz	2020 IEEE/CVF Winter Conference on Applications of Computer Vision
Kurztitel	WACV 2020
Land/Gebiet	USA / Vereinigte Staaten
Ort	Snowmass Village
Zeitraum	1/03/20 → 5/03/20

ASJC Scopus subject areas

Maschinelles Sehen und Mustererkennung
Angewandte Informatik

Zugriff auf Dokument

http://openaccess.thecvf.com/content_WACV_2020/papers/Ilic_Representing_Objects_in_Video_as_Space-Time_Volumes_by_Combining_Top-Down_WACV_2020_paper.pdf

Dieses zitieren

Ilic, F., & Pinz, A. (2020). Representing Objects in Video as Space-Time Volumes by Combining Top-Down and Bottom-Up Processes. in Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020 (S. 1903-1911). Artikel 9093410 (Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020). http://openaccess.thecvf.com/content_WACV_2020/papers/Ilic_Representing_Objects_in_Video_as_Space-Time_Volumes_by_Combining_Top-Down_WACV_2020_paper.pdf

Representing Objects in Video as Space-Time Volumes by Combining Top-Down and Bottom-Up Processes. / Ilic, Filip ; Pinz, Axel.
Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020. 2020. S. 1903-1911 9093410 (Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020).

Publikation: Beitrag in Buch/Bericht/Konferenzband › Beitrag in einem Konferenzband › Begutachtung

Ilic, F & Pinz, A 2020, Representing Objects in Video as Space-Time Volumes by Combining Top-Down and Bottom-Up Processes. in Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020., 9093410, Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020, S. 1903-1911, 2020 IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, Colorado, USA / Vereinigte Staaten, 1/03/20. <http://openaccess.thecvf.com/content_WACV_2020/papers/Ilic_Representing_Objects_in_Video_as_Space-Time_Volumes_by_Combining_Top-Down_WACV_2020_paper.pdf>

@inproceedings{d7247264839143f1a442f872c75c3762,

title = "Representing Objects in Video as Space-Time Volumes by Combining Top-Down and Bottom-Up Processes",

abstract = "As top-down based approaches of object recognition from video are getting more powerful, a structured way to combine them with bottom-up grouping processes becomes feasible. When done right, the resulting representation is able to describe objects and their decomposition into parts at appropriate spatio-temporal scales. We propose a method that uses a modern object detector to focus on salient structures in video, and a dense optical flow estimator to supplement feature extraction. From these structures we extract space-time volumes of interest (STVIs) by smoothing in spatio-temporal Gaussian Scale Space that guides bottom-up grouping. The resulting novel representation enables us to analyze and visualize the decomposition of an object into meaningful parts while preserving temporal object continuity. Our experimental validation is twofold. First, we achieve competitive results on a common video object segmentation benchmark. Second, we extend this benchmark with high quality object part annotations, DAVIS Parts 1, on which we establish a strong baseline by showing that our method yields spatio-temporally meaningful object parts. Our new representation will support applications that require high-level space-time reasoning at the parts level. ",

author = "Filip Ilic and Axel Pinz",

year = "2020",

month = mar,

day = "1",

language = "English",

series = "Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020",

pages = "1903--1911",

booktitle = "Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020",

note = "wacv2020 : WACV 2020, WACV 2020 ; Conference date: 01-03-2020 Through 05-03-2020",

}

TY - GEN

T1 - Representing Objects in Video as Space-Time Volumes by Combining Top-Down and Bottom-Up Processes

AU - Ilic, Filip

AU - Pinz, Axel

PY - 2020/3/1

Y1 - 2020/3/1

N2 - As top-down based approaches of object recognition from video are getting more powerful, a structured way to combine them with bottom-up grouping processes becomes feasible. When done right, the resulting representation is able to describe objects and their decomposition into parts at appropriate spatio-temporal scales. We propose a method that uses a modern object detector to focus on salient structures in video, and a dense optical flow estimator to supplement feature extraction. From these structures we extract space-time volumes of interest (STVIs) by smoothing in spatio-temporal Gaussian Scale Space that guides bottom-up grouping. The resulting novel representation enables us to analyze and visualize the decomposition of an object into meaningful parts while preserving temporal object continuity. Our experimental validation is twofold. First, we achieve competitive results on a common video object segmentation benchmark. Second, we extend this benchmark with high quality object part annotations, DAVIS Parts 1, on which we establish a strong baseline by showing that our method yields spatio-temporally meaningful object parts. Our new representation will support applications that require high-level space-time reasoning at the parts level.

AB - As top-down based approaches of object recognition from video are getting more powerful, a structured way to combine them with bottom-up grouping processes becomes feasible. When done right, the resulting representation is able to describe objects and their decomposition into parts at appropriate spatio-temporal scales. We propose a method that uses a modern object detector to focus on salient structures in video, and a dense optical flow estimator to supplement feature extraction. From these structures we extract space-time volumes of interest (STVIs) by smoothing in spatio-temporal Gaussian Scale Space that guides bottom-up grouping. The resulting novel representation enables us to analyze and visualize the decomposition of an object into meaningful parts while preserving temporal object continuity. Our experimental validation is twofold. First, we achieve competitive results on a common video object segmentation benchmark. Second, we extend this benchmark with high quality object part annotations, DAVIS Parts 1, on which we establish a strong baseline by showing that our method yields spatio-temporally meaningful object parts. Our new representation will support applications that require high-level space-time reasoning at the parts level.

M3 - Conference paper

T3 - Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020

SP - 1903

EP - 1911

BT - Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020

T2 - wacv2020

Y2 - 1 March 2020 through 5 March 2020

ER -

Representing Objects in Video as Space-Time Volumes by Combining Top-Down and Bottom-Up Processes

Abstract

Publikationsreihe

Konferenz

ASJC Scopus subject areas

Zugriff auf Dokument

Fingerprint

Dieses zitieren