Seite - 18 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics

Bild der Seite - 18 -

Text der Seite - 18 -

The comparison between two binary descriptors itself is done via calculating the Hamming distance [14], which is the number of different bits and a very efficient operation. Nor- mally, the descriptor with the minimum Hamming distance is chosen as the best match. To improve the robustness of the matching, we additionally apply the distance-ratio-test as proposed in [20]. It just accepts a match if the ratio between the two closest neighbors is below a threshold rmax∈Rwith 0< rmax< 1. Using binary descriptors, the ratio rH ∈R between two descriptors is defined as rH= dH,1 dH,2 <rmax, (2) where dH,1∈N and dH,2∈N are the Hamming distances of the two closest neighbors, respectively. In our case, we use an empirical threshold of rmax=0.71 which helps to remove ambiguous matches that can occur at repeatable structures like branches. C. Motion Estimation and Key Frame Selection In this step, the calculation of the relative camera motion, i.e. the relative transformation TC,mC,m−1 between an image pair {m−1,m}, takes place. Therefore, we use calibrated stereo-cameras and two sets of corresponding featuresFm−1 andFm of the imagesm−1 andm, respectively. For the 3D-to-2D algorithm, the features of Fm−1 are defined by 3D points in Cm−1 and the one of Fm by 2D image points [24]. Normally, we use 2D features of the left image with coordinate system vm. Alternatively, if the motion estimation fails due to less feature matches, features of the right image with coordinate system v′m can also be used to prevent a failure of the VO. The estimation of the 3D points is done via the linear triangulation method of Hartley and Zissermann [16], which is implemented in OpenCV [5]. Using a function dE to calculate the Euclidean distance [11], the transformation TC,mC,m−1 can be found through minimizing the image reprojection error of all features min TC,mC,m−1 n ∑ i=1 dE ( v,mtv,mx,i, v,mtˆv,mx,i(TC,mC,m−1) )2 . (3) Thereby, v,mtv,mx,i is the 2D coordinate vector of the image point xi and v,mtˆv,mx,i the image coordinate vector of the 3D point Xi, which is observed in Cm−1 and projected through TC,mC,m−1 and the corresponding camera projection matrix [16] into imagem. Equation (3) can be solved using at least three 3D-to-2D correspondences, is known as P3P (Perspec- tive from three Points) and returns four solutions. Therefore, at least one another point is necessary to get a single and distinct solution. PnP-algorithms (Perspective fromnPoints) like EPnP (Efficient PnP) [18] use n≥3 correspondences to solve the problem. Normally, these methods just calculate accurate results if the used correspondences are correct. If this is not guaranteed, the well known procedure RANSAC (Random Sample Consensus) [12] should be used to remove wrong correspondences, so called outliers. In [13], such a robust motion estimation using RANSAC is explained more in detail. Our VO uses EPnP for the pose estimation and a preliminary non-minimal RANSAC with five points to acquire trustworthy results of the outlier removal as suggested by Fraundorfer et al. [13]. If the first motion estimation with the left image fails due to less featurematches,or themotion is implausible (position or orientation is unrealistic), then the estimation is retried with 2D features of another image as a backup. The order of these images is the following. Firstly, the right image of the actual stereo frame is used. If the motion estimation with the features of this image is also unsuccessful, then a consecutive still unused left or right image is used until the motion estimation step is successful. This procedure avoids a failure of the VO with high probability. The selection of key frames is another important compo- nent of our VO. In general, the drift of a VO increases with every frame, i.e. every relative motion, which is used for the update of the absolute motion. Therefore, the concatenation of smallmotionsshouldbeavoided tokeep thedrift as lowas possible. This means that the transformation TC,mC,m−1 should not be used to update the absolute transformation TC,mC,1 if the motion between the image pair {m−1,m} is small or even zero. Instead, we should stay with TC,m−1C,1. We define a stereo framem as a key framem if its relative transformation is used for the absolute motion update. Our defined requirement is that the relative change in position is bigger than 2m or the relative angle of rotation [9] is bigger than 20◦. D. Bundle Adjustment Windowed bundle adjustment [28] is the last important step in our feature-based VO system. It is used to optimize the relative transformations of the most recentM key frames. For simplicity, we assume n 3D-points i∈{1,...,n}, which are seen in a window ofM≤m key frames j∈{m,...,m}. Hereby, the index of the oldest stereo frame in the window is defined as m=(m−M+1). To reduce the computation demand, our VO just uses a window with the most recent M=2 key frames, i.e. in total the features of four images are used for the optimization. Bundle Adjustment is, like in (3), again the minimization of the image reprojection error and is given by min TC,jC,1,C,1tC,1X,i n ∑ i=1 m ∑ j=m dE ( v,jtv,jx,i, v,jtˆv,jx,i(TC,jC,1,C,1tC,1X,i) )2 . (4) Thereby, v,jtv,jx,i and v,jtˆv,jx,i are, respectively, the vectors of the observed and estimated 2D coordinates of point i in key frame j. Due to the projection of the point Xi into the image plane, the estimated coordinates are dependent on the absolute transformations TC,jC,1, the 3D coordinate vector C,1tC,1X,i and the corresponding camera projection matrices. The camera parameters are assumed as constant and known via a prior calibration. The minimization of (4) is done using the sparse bundle adjustment library of Lourakis et al. [19]. 18

zurück zum Buch Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics"

Seite - 18 - in Proceedings of the OAGM&ARW Joint Workshop - Vision, Automation and Robotics

Bild der Seite - 18 -

Text der Seite - 18 -

Inhaltsverzeichnis