Page - 79 - in Document Image Processing
Image of the Page - 79 -
Text of the Page - 79 -
J. Imaging 2018,4, 37
5.1.DataSets andEvaluationProtocols
In this subsection, we discuss datasets and the experimental settings that we follow in the
experiments.Ourdatasets,given inTable2, comprisescannedEnglishbooks fromadigital library
collection. Wemanually createdground truth atword level for thequantitative evaluationof the
methods. Thefirstcollection (D1)ofwords is fromabookwhich is reasonablyclean. Seconddataset
(D2) is larger insizeandisusedtodemonstrate theperformance incaseofheterogeneousprint styles.
Thirddataset (D3) isanoisybookandisusedtodemonstrate theutilityof theperformanceofour
methodindegradedcollections.Wehavealsogiventheresultsover thepopularGeorgeWashington
dataset. For theexperiments,weextractprofile features [11] foreachof thewordimages. In this,we
divide the imagehorizontally into twoparts and the following features are computed: (i) vertical
profile i.e thenumberof inkpixels ineachcolumn(ii) locationof lowermost inkpixel, (ii) location
ofuppermost inkpixel and (iv)numberof ink tobackground transitions. Theprofile features are
calculatedonbinarizedwordimagesobtainedusingtheOtsuthresholdingalgorithm.Thefeaturesare
normalizedto [0,1], soas toavoiddominanceofanyspecific feature.
Toevaluate thequantitativeperformance,multiple query imagesweregenerated. Thequery
imagesareselectedsuchthat theyhavemultipleoccurrences in thedatabaseandaremostly functional
wordsanddonot includethestopwords. Theperformance ismeasuredbymeanAveragePrecision
(mAP),which is themeanof theareaunder theprecision-recall curve forall thequeries.
Table2. Details of thedatasets considered in theexperiments. Thefirst collection (D1)ofwords is
fromabookwhich is reasonably clean. The seconddataset (D2) is obtained from2books and is
usedtodemonstrate theperformance incaseofheterogeneousprintstyles. Thethirddataset (D3) is
anoisybook.
Dataset Source Type #Images #Queries
D1 1Book Clean 14,510 100
D2 2Books Clean 32,180 100
D3 1Book Noisy 4100 100
5.2. ExperimentalSettings
For representingword images,weprefer afixed length sequence representationof thevisual
content, i.e., eachword image is representedas afixed length sequenceofvertical strips. Aset of
features f1,. . ., fL areextracted,where fi∈RM is the feature representationof the ithvertical strip
and L is thenumberofvertical strips. This canbe consideredas a single featurevector F∈Rd of
sized= LM. We implement thequeryspecificalignmentbasedsolutionasdiscussed inSection4.
Forquery expansionbased solution,we identify thefivemost similar samples to thequeryusing
approximatenearestneighborsearchandcompute theirmean.
Eachdatasetcontainscertainwordswhicharemore frequent thanothers. Thenumberofsamples
in the frequentwordclassesaremorecomparedto therareclasses. Theretrieval results for frequent
queriesgivebetterperformancebecause thenumberof relevant samplesavailable in thedataset is
greater. It isworthemphasizingthat for themethodproposedinthispaper(QSDTW), thedegradation
in theperformance for rarequeries ismuchlesscomparedtoothermethods.
5.3. Results forFrequentQueries
Table3compares theretrievalperformanceof thedirectqueryclassifierDQCwith thenearest
neighborclassifierusingdifferentoptions fordistancemeasures. Theperformance is showninterms
ofmeanaverageprecision (mAP)valueson threedatasets. For thenearest neighbor classifier,we
experimentedwithfivedistancemeasures: naiveDTWdistance,FastapproximateDTWdistance [20],
query specific DTW (QS DTW) distance, FastDTW [30] and Euclidean distance. We see that DTW
79
back to the
book Document Image Processing"
Document Image Processing
- Title
- Document Image Processing
- Authors
- Ergina Kavallieratou
- Laurence Likforman-Sulem
- Editor
- MDPI
- Location
- Basel
- Date
- 2018
- Language
- German
- License
- CC BY-NC-ND 4.0
- ISBN
- 978-3-03897-106-1
- Size
- 17.0 x 24.4 cm
- Pages
- 216
- Keywords
- document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
- Category
- Informatik