Abstract:
A multi-modal 3D point cloud registration model with multi-scale feature fusion (multi-scale feature fusion, MSFNet) is proposed to address the problems that existing descriptor extraction methods in 3D point cloud registration may result in insignificant point cloud structure information and loss of point cloud data details. Firstly, a channel attention module based on sparse convolution (channel attention module based on sparse convolution, SCCAM) is employed in the encoder to enable the model to adaptively focus on the feature structure of the point cloud. Then, a multi-scale spatial point cloud encoding structure (multi-scale spatial point cloud encoding, MSPCE) is used to extract and effectively fuse point cloud features at different scales, thereby increasing the receptive field of the point cloud descriptor. Finally, a multi-modal feature fusion module is used to fuse the point cloud features extracted by the encoder with image features, which are then fed into the decoder for supervised training to generate the final point cloud descriptor. Feature-Match Recall (FMR) is employed as an evaluation metric to conduct experiments on the 3DMatch dataset. The experimental results show that the recall accuracy of the MSFNet achieves 98.4%, which is 0.8% higher than that of the IMFNet (Interpretable Multimodal Fusion).