Page - 182 - in Document Image Processing
Image of the Page - 182 -
Text of the Page - 182 -
J. Imaging 2017,3, 62
createdusing the 127 original images and transformedusing our 3Ddistortionmodel. The tests
presented in [39,59] confirm the conclusion of [60] about the impact of the degradation level on
re-training,either fora taskofcharacter recognitionor layoutextraction.
4.2.NewResults onPerformancePredictionUsingDocCreator
Here,weshowwhetherDocCreatorcanbeuseful forperformancepredictionofexistingmethods.
4.2.1. Increase thePredictionRateofPredictiveBinarizationAlgorithm
In [61],wehavepresented an algorithm topredict error rates of 11 binarizationmethods on
givendocument imagesso that thebestbinarizationmethodisautomaticallychosenforanyimage
dependingon its quality. Thismethod requires ground-trutheddata as input of the training step.
TheDIBCOdatabase [62]wasused.However, theDIBCOdatabasecontainsonly36 images.
Weproposehere to extend theoriginalDIBCOdatabasebyusing the inkdegradationmodel.
Since theDIBCOdatabasecontains36 images,weextenditwith thesamenumberofsemi-synthetic
document images. Thisextendeddataset is thenusedto train thepredictionmodelof [61].
Our retraining tests show that the use of this extended dataset allows one to increase the
performanceof thepredictionmodelof [61].Moreprecisely, theerror rateof thepredictionmodel
decreases (until it levelsoff)whenthenumberofsemi-synthetic images in the trainingset increases.
Onaverage, theerrorratedropsofabout15%comparedwithusingonlyreal images in the training
set. The error rate convergeswhen theproportionof semi-synthetic images is around50%of the
trainingset.
4.2.2. PredictOCRRecognitionRateUsingSynthetic Images
Many on-line digital libraries propose a text search engine. To this end, the textwithin the
document imageshas tobe transcribed.DependingontheOCRrecognitionratequality, threeoptions
areavailable: (1)directlyuse theOCRresultwhentherecognitionrate isclose to100%; (2)manually
correcttheOCRresultwhentheautomatictranscriptiongives“acceptablequality”,or(3)doacomplete
manual transcription(oftenquiteexpansive).Asaconsequence, it isvery important tobeawareof
theOCRrecognitionratebeforedecidingbetweenoneof these threesolutions. Theamountof recent
publicationsonthis subject ([63–66]) reflects thescientific interest inpredictingOCRsrecognitionrate.
Weproposehere tousesynthetic images topredict theOCRrateofadigitizedbookas follows:
(1) font, backgroundand layoutareextracted fromoriginal images (withmethodsdescribed in II).
It isnoteworthy tomention that the fontswereextracted thoroughly, inparticular to includeeven
characters not recognized by theOCR, or even to adjustmargins of correctly labeled characters.
(2)AnadaptedLoremipsumtext is randomlygeneratedandusedtocreatesynthetic imageswith the
fontandbackgroundpreviouslyextracted. ThisadaptedLoremipsumisgeneratedwithaccentuated
characters (é, à, ù, etc.) and old characters (ff, fi, s, fl,ffi) if the original text contains such
characters. Generating such characters is important tohave a representativedataset for fairOCR
testing. As a result, images like the one presented in Figure 2 are generatedwith the associated
XMLgroundtruth. (3)AnOCR(Tesseract) isfinallyusedtorecognize the textonthesesynthesized
images. This text iscomparedwith theLoremipsumgroundtruth text,givinganOCRrecognition
rate.Weconsider that this recognitionrate isapredictionof theOCRrate if theOCRsoftwarewas
appliedonoriginal images. Table2Column1provides theaverageOCRrecognitionrateobtained
ontheoriginal images,Table2Column3refers to theaverageOCRratescomputedonthesynthetic
“Loremipsum”imagesversions.
Wealsopropose toevaluate thecapacityofourmethodtocorrectlypredict theOCRrecognition
ratebycomparingoriginal imageswith their syntheticversiongeneratedwithexactly thesametext
(seeFigure11tocomparetheoriginal imagesandtheirsyntheticversions). These imagesaregenerated
following this protocol: (1) pages from three books (2 typewritten and 1manuscript book) have
beenmanually transcribed; (2) font,backgroundandlayoutareautomaticallyextractedfromoriginal
182
back to the
book Document Image Processing"
Document Image Processing
- Title
- Document Image Processing
- Authors
- Ergina Kavallieratou
- Laurence Likforman-Sulem
- Editor
- MDPI
- Location
- Basel
- Date
- 2018
- Language
- German
- License
- CC BY-NC-ND 4.0
- ISBN
- 978-3-03897-106-1
- Size
- 17.0 x 24.4 cm
- Pages
- 216
- Keywords
- document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
- Category
- Informatik