Seite - 8 - in Joint Austrian Computer Vision and Robotics Workshop 2020
Bild der Seite - 8 -
Text der Seite - 8 -
to-endtraining. Instead,Shalnovetal. [30]wereable
tocreateadeepmodelusingaCNNforcamerapose
estimation via object detections of human heads. In
theworkofPavlakosetal. [24]ageometricapproach
to object pose estimation using semantic keypoints
is taken but their published dataset only uses out-
door objects and is thus not applicable to docking.
Lastly, as part of Zhou et al. [35]’s Centernet, they
are proposing to regress from centerpoints to other
object properties including pose but their framework
isunnecessarilycomplex for the taskathand.
While themethodsarenumerous,nosingleframe-
work exists that combines deep learning for object
detections with a PnP-solver, all for the application
of mobile robot docking. This work shows the hes-
itation of using CNNs for robot docking is unwar-
ranted, as long as the learning task is managable in
complexity.
3.METHODSANDIMPLEMENTATION
Therobotinomobile robotusedin thisprojectwas
equipped with a Logitech C920 USB webcam. A re-
motedesktopwithanNVIDIAGTX1080GPUruns
ROSto control the robot andprocess the images.
To showcase the flexibility of the pipeline regard-
ing thevisual target,noQR-tagsorARUCOmarkers
[8, 27] were used. Three small paper printouts of a
logo were instead fixed on a board roughly20by15
centimeters in sizeand thisboardwasused for train-
ing the detector. Video data was collected while ar-
bitrarily moving the robot around close to the target.
From the roughly 4500 recorded images 100were
selected to form the training set. The chosen images
show the target from different viewing angles, dis-
tances, lightingconditionswhileafewimagesdonot
show the target at all to control for false positives.
The bounding box coordinates of the three logos in
all 100 images were manually annotated. Creating
annotationstookaroundthreehours tocomplete. Re-
sizing the images to 512x512 RGB-images allows
the usage of Che et al. [4]’s toolbox with many dif-
ferentobjectdetectors implemented.
Accuracyofthedetector is important, sincewrong
detections would lead to wrong pose estimates and
erroneous controls, while inference speed is impor-
tant to enable a smooth docking, although inference
times below 70 milliseconds are unnecessary, due
to the bottleneck imposed by transporting 960x720
images from the camera to the remote desktop via
Wi-Fi using the ROS image transport package for compressed transfer. Looking at various speed-
accuracytradeoffcomparisonsbetweenobjectdetec-
tors,FasterR-CNN[26]withpretrainedResNet [10]
backbones seems to be a sweet spot, ResNet50 was
chosen for this implementation. Faster R-CNN be-
longs to theclassofdetectorsusingaseparate region
proposal network to generate bounding box propos-
als. For the optimizer the default stochastic gradient
descent with momentum of 0.9was used and learn-
ing rate was kept default at 0.01with a linear step
learning rate scheduler and warmup. Other parame-
tersandimageaugementationstepswerekeptdefault
toCheetal. [4]’s configurationofFasterR-CNNfor
PascalVOC [5], including a 50 percent chance of a
random horizontal flip. From the infered bounding
boxes, imagecoordinatesof theupper-leftandlower-
rightcornersofall three logosare savedfor thePnP-
solver. The Faster R-CNN network was trained for
fifty epochs which amounted to 37 minutes training
timeonaGTX1080graphicscard. GPUmemoryus-
age was 2GB showing a weaker graphics unit would
suffice. Both bounding box and classification loss
plateuedafter training for tenepochs.
The required pose estimate at timestep t for path
planning can be described with the transformation
matrixTtargetbase,t ∈R4x4 fromthebase linkof therobot
to the targetpositionnear the station
Ttargetbase,t= [
R ~t
~0 1 ]
t (2)
withR∈R3x3 and~t∈R3x1 being the rotation ma-
trix and translation vector to be estimated at sam-
pling time t respectively. Physically measuring the
transformation from the base link of the robot to
the camera sensor as well as relating the logos at
Klogo to the target locationallowsanestimatedcam-
era pose from a reference coordinate system on the
logo-board Tcameralogo to be linearly tranformed into
Ttargetbase . Getting the transformationT camera
logo with a
calibrated camera and assuming the pinhole camera
model means solving correspondences of points in
2Dimagespaceand thosesamepoints in the3Dreal
world. After calibrating the camera using the ROS
camera calibration package, the measured points in
the object frame and saved image coordinates are
combined in theOpen-CVsolvePnPalgorithmusing
the intrinsic cameraparameters. Availablevariations
of the algorithm are iterative, which is the default
methodbasedonLevenberg-Marquardtoptimization
[15] to find a pose which minimizes reprojection er-
8
Joint Austrian Computer Vision and Robotics Workshop 2020
- Titel
- Joint Austrian Computer Vision and Robotics Workshop 2020
- Herausgeber
- Graz University of Technology
- Ort
- Graz
- Datum
- 2020
- Sprache
- englisch
- Lizenz
- CC BY 4.0
- ISBN
- 978-3-85125-752-6
- Abmessungen
- 21.0 x 29.7 cm
- Seiten
- 188
- Kategorien
- Informatik
- Technik