Indexed by:
Abstract:
Sparse-view planar 3D reconstruction aims to recover scene information from limited camera frames, which poses a fundamental problem in computer vision. Although previous methods have made significant improvements in this field, they have not adequately considered the multi-scale properties of the surrounding environment, thus limiting the reconstruction performance. Additionally, the conventional feed-forward network in the vanilla Transformer is constructed using fully connected layers, lacking the ability to capture local information from image features. To address these two problems, this paper proposes a sparse-view planar 3D reconstruction method based on hierarchical token pooling Transformer (i.e. HTP-Formers). Specifically, we utilize average pooling layers with various ratios in Transformer model to capture multi-scale features. Subsequently, we propose a depth-wise convolution based inverted residual feed-forward network to enhance local information extraction performance at negligible computational cost. To demonstrate the effectiveness of HTP-Formers on planar 3D reconstruction tasks, we thoroughly evaluate the proposed model on Matterport3D public dataset. Especially, HTP-Formers improves performance by 6.1% and 18.3% in translational and rotational errors, respectively, outperforming most existing planar 3D reconstruction methods in terms of planar correspondence inference and relative camera pose estimation.
Keyword:
Reprint Author's Address:
Email:
Source :
APPLIED SOFT COMPUTING
ISSN: 1568-4946
Year: 2025
Volume: 174
8 . 7 0 0
JCR@2022
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 12
Affiliated Colleges: