Page - 8 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 8 -

Text of the Page - 8 -

to-endtraining. Instead,Shalnovetal. [30]wereable tocreateadeepmodelusingaCNNforcamerapose estimation via object detections of human heads. In theworkofPavlakosetal. [24]ageometricapproach to object pose estimation using semantic keypoints is taken but their published dataset only uses out- door objects and is thus not applicable to docking. Lastly, as part of Zhou et al. [35]’s Centernet, they are proposing to regress from centerpoints to other object properties including pose but their framework isunnecessarilycomplex for the taskathand. While themethodsarenumerous,nosingleframe- work exists that combines deep learning for object detections with a PnP-solver, all for the application of mobile robot docking. This work shows the hes- itation of using CNNs for robot docking is unwar- ranted, as long as the learning task is managable in complexity. 3.METHODSANDIMPLEMENTATION Therobotinomobile robotusedin thisprojectwas equipped with a Logitech C920 USB webcam. A re- motedesktopwithanNVIDIAGTX1080GPUruns ROSto control the robot andprocess the images. To showcase the flexibility of the pipeline regard- ing thevisual target,noQR-tagsorARUCOmarkers [8, 27] were used. Three small paper printouts of a logo were instead fixed on a board roughly20by15 centimeters in sizeand thisboardwasused for train- ing the detector. Video data was collected while ar- bitrarily moving the robot around close to the target. From the roughly 4500 recorded images 100were selected to form the training set. The chosen images show the target from different viewing angles, dis- tances, lightingconditionswhileafewimagesdonot show the target at all to control for false positives. The bounding box coordinates of the three logos in all 100 images were manually annotated. Creating annotationstookaroundthreehours tocomplete. Re- sizing the images to 512x512 RGB-images allows the usage of Che et al. [4]’s toolbox with many dif- ferentobjectdetectors implemented. Accuracyofthedetector is important, sincewrong detections would lead to wrong pose estimates and erroneous controls, while inference speed is impor- tant to enable a smooth docking, although inference times below 70 milliseconds are unnecessary, due to the bottleneck imposed by transporting 960x720 images from the camera to the remote desktop via Wi-Fi using the ROS image transport package for compressed transfer. Looking at various speed- accuracytradeoffcomparisonsbetweenobjectdetec- tors,FasterR-CNN[26]withpretrainedResNet [10] backbones seems to be a sweet spot, ResNet50 was chosen for this implementation. Faster R-CNN be- longs to theclassofdetectorsusingaseparate region proposal network to generate bounding box propos- als. For the optimizer the default stochastic gradient descent with momentum of 0.9was used and learn- ing rate was kept default at 0.01with a linear step learning rate scheduler and warmup. Other parame- tersandimageaugementationstepswerekeptdefault toCheetal. [4]’s configurationofFasterR-CNNfor PascalVOC [5], including a 50 percent chance of a random horizontal flip. From the infered bounding boxes, imagecoordinatesof theupper-leftandlower- rightcornersofall three logosare savedfor thePnP- solver. The Faster R-CNN network was trained for fifty epochs which amounted to 37 minutes training timeonaGTX1080graphicscard. GPUmemoryus- age was 2GB showing a weaker graphics unit would suffice. Both bounding box and classification loss plateuedafter training for tenepochs. The required pose estimate at timestep t for path planning can be described with the transformation matrixTtargetbase,t ∈R4x4 fromthebase linkof therobot to the targetpositionnear the station Ttargetbase,t= [ R ~t ~0 1 ] t (2) withR∈R3x3 and~t∈R3x1 being the rotation ma- trix and translation vector to be estimated at sam- pling time t respectively. Physically measuring the transformation from the base link of the robot to the camera sensor as well as relating the logos at Klogo to the target locationallowsanestimatedcam- era pose from a reference coordinate system on the logo-board Tcameralogo to be linearly tranformed into Ttargetbase . Getting the transformationT camera logo with a calibrated camera and assuming the pinhole camera model means solving correspondences of points in 2Dimagespaceand thosesamepoints in the3Dreal world. After calibrating the camera using the ROS camera calibration package, the measured points in the object frame and saved image coordinates are combined in theOpen-CVsolvePnPalgorithmusing the intrinsic cameraparameters. Availablevariations of the algorithm are iterative, which is the default methodbasedonLevenberg-Marquardtoptimization [15] to find a pose which minimizes reprojection er- 8

back to the book Joint Austrian Computer Vision and Robotics Workshop 2020"

Joint Austrian Computer Vision and Robotics Workshop 2020

Title: Joint Austrian Computer Vision and Robotics Workshop 2020
Editor: Graz University of Technology
Location: Graz
Date: 2020
Language: English
License: CC BY 4.0
ISBN: 978-3-85125-752-6
Size: 21.0 x 29.7 cm
Pages: 188
Categories: Informatik; Technik