ATLAS-MVSNet: Attention Layers for Feature Extraction and Cost Volume Regularization in Multi-View Stereo

Research output: Chapter in Book/Report/Conference proceedingConference paperpeer-review

Abstract

We present ATLAS-MVSNet, an end-to-end deep learning architecture relying on local attention layers for depth map inference from multi-view images. Distinct from existing works, we introduce a novel module design for neural networks, which we termed hybrid attention block, that utilizes the latest insights into attention in vision models. We are able to reap the benefits of attention in both, the carefully designed multi-stage feature extraction network and the cost volume regularization network. Our new approach displays significant improvement over its counterpart based purely on convolutions. While many state-of-the-art methods need multiple high-end GPUs in the training phase, we are able to train our network on a single consumer grade GPU. ATLAS-MVSNet exhibits excellent performance, especially in terms of accuracy, on the DTU dataset. Furthermore, ATLAS-MVSNet ranks amongst the top published methods on the online Tanks and Temples benchmark.
Original languageEnglish
Title of host publication2022 26th International Conference on Pattern Recognition, ICPR 2022
PublisherACM/IEEE
Pages3557-3563
Number of pages7
ISBN (Electronic)9781665490627
ISBN (Print)978-1-6654-9063-4
DOIs
Publication statusPublished - 25 Aug 2022
Event26th International Conference on Pattern Recognition: ICPR 2022 - Montreal, Canada
Duration: 21 Aug 202225 Aug 2022

Conference

Conference26th International Conference on Pattern Recognition
Abbreviated titleICPR 2022
Country/TerritoryCanada
CityMontreal
Period21/08/2225/08/22

Keywords

  • Training
  • Three-dimensional displays
  • Costs
  • Memory management
  • Neural networks
  • Graphics processing units
  • Deep architecture

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'ATLAS-MVSNet: Attention Layers for Feature Extraction and Cost Volume Regularization in Multi-View Stereo'. Together they form a unique fingerprint.

Cite this