Mallick, ArijitEngelhardt, AndreasBraun, RaphaelLensch, Hendrik P. A.Bender, JanBotsch, MarioKeim, Daniel A.2022-09-262022-09-262022978-3-03868-189-2https://doi.org/10.2312/vmv.20221197https://diglib.eg.org:443/handle/10.2312/vmv20221197Image super resolution is a classical computer vision problem. A branch of super resolution tasks deals with guided depth super resolution as objective. Here, the goal is to accurately upsample a given low resolution depth map with the help of features aggregated from the high resolution color image of that particular scene. Recently, the development of transformers has improved performance for general image processing tasks credited to self-attention. Unlike previous methods for guided joint depth upsampling which rely mostly on CNNs, we efficiently compute self-attention with the help of local image attention which avoids the quadratic growth typically found in self-attention layers. Our work combines CNNs and transformers to analyze the two input modalities and employs a cross-modal fusion network in order to predict both a weighted per-pixel filter kernel and a residual for the depth estimation. To further enhance the final output, we integrate a differentiable and a trainable deep guided filtering network which provides an additional depth prior. An ablation study and empirical trials demonstrate the importance of each proposed module. Our method shows comparable as well as state-of-the-art performance on the guided depth upsampling task.Attribution 4.0 International LicenseCCS Concepts: Computing methodologies --> Computer vision; Image representations; ReconstructionComputing methodologiesComputer visionImage representationsReconstructionLocal Attention Guided Joint Depth Upsampling10.2312/vmv.202211971-88 pages