Introduction
The field of Simultaneous Localization and Mapping (SLAM) has seen significant advancements with the advent of deep learning. These advancements have led to the categorization of SLAM methods into two primary classes based on the process of camera pose estimation: direct regression-based methods and optimization-based methods.
Direct regression-based methods, such as DEMON, DeepTAM, and DytanVO, estimate the camera pose directly using deep networks. In contrast, optimization-based methods, including CodeSLAM, BA-Net, DeepFactors, and DROID-SLAM, minimize residuals from various factors to estimate the camera pose.
While direct regression methods offer simplicity, optimization-based methods have been more robust, especially when incorporating additional data sources like Inertial Measurement Units (IMUs). IMUs can enhance the robustness of visual SLAM methods, particularly during rapid movements. However, effectively fusing visual and IMU factors into a deep network remains a challenge.
Method
The proposed Dual Visual Inertial SLAM network (DVI-SLAM) addresses this challenge by dynamically fusing multiple factors into an end-to-end trainable, differentiable structure. This method provides complementary cues to explore visual information and can extend to include additional factors like IMUs.
Two-view Reconstruction
The DVI-SLAM network is designed for two-view reconstruction, estimating camera pose, IMU motion, and inverse depth from two views. The network utilizes a feature extraction module, a multi-factor data association module, and a multi-factor Dense Bundle Adjustment (DBA) layer.
Feature Extraction Module
This module extracts features from the input images, providing a basis for further processing.
Multi-factor Data Association Module
This module associates features from the current frame with features from previous frames, considering both visual and IMU factors.
Multi-factor DBA Layer
This layer performs DBA to optimize the camera pose, IMU motion, and inverse depth estimates, dynamically adjusting the confidence maps of the different factors.
Dynamic Multi-factor Fusion
The DVI-SLAM network dynamically adjusts the confidence maps of the various factors during optimization, ensuring that each factor contributes appropriately to the overall estimate. This dynamic fusion allows for a more robust and accurate estimation of camera pose and depth.
Figure 1. Overview of the DVI-SLAM structure for two-view reconstruction.
In conclusion, the DVI-SLAM network offers a novel approach to fusing visual and IMU factors in a deep learning framework. Its dynamic multi-factor fusion and tight coupling between visual-inertial SLAM and DBA optimization make it a promising solution for accurate and robust SLAM in various applications.
Views: 0