Page - 174 - in Document Image Processing
Image of the Page - 174 -
Text of the Page - 174 -
J. Imaging 2017,3, 62
the extracted font, onto the reconstructedbackground,witha layout similar tooneof theoriginal
documents (seeFigure2).
The font is extractedusinga semi-automaticmethod. First, anOpticalCharacterRecognition
system (OCR) automatically associates a label (Unicode value) to each character on the image.
TesseractOCRengine [40] isused. Thentheusercanmodify thecharacterproperties (label,baseline,
margins, etc.). Our softwareallowsus topair agivensymbolwith several character images in the
extractedfont. Thus,whenthe font isusedtowritea text, thesoftwarewill randomlychooseanimage
for the required symbol. Thiswillmake theļ¬nal output lookmore realistic by reducing the strict
uniformitybetweensimilar letters.
Inorder tocorrectlywrite textwith this font, thebaselineofeachcharacterhas tobecomputed.
Thebaseline is the imaginarystraight lineonwhichsemicursiveorcursive textarealignedandupon
whichmost letters āsitā and, belowwhich, descenders extend. In order to extract each character
baselineanddealwithvariousdocuments,weproposeadifferentapproachfromclassicalones [40,41].
Themain originality of our baseline extractionmethod is that the baseline is computed for each
character individually insteadofļ¬ndingabaseline for thewhole line.Weevaluatedthismethodon
more than5000manuallyannotatedbaselinesconsideredas thegroundtruth. Thebaselineextraction
error rate is relatively close the one obtained using [40] (the same baseline extraction error rate).
However,ourmethodhas themainadvantageofbeingrobust toskeworientation, thehand-written
wavypatternandunalignedcolumnsinasamepage. Besides, theinterwordsdistanceisautomatically
computedas theaverageofcharacterswidth. For the intercharactersdistance,none isspeciļ¬edby
default. However, theuser can interactivelyspecify the left andrightmargin tobetterposition the
characterrelativetoothers. TheGUIofDocCreatoralsoallowstheusertomodifyseveralparametersto
improve theextractedcharacters. Theusercanchange thebaselineor the letterassignedtoacharacter
andsmooththeborderofacharacter.Via this semi-automatic fontextractionmethod, theuser isable
to correctmistakesmadeby theOCR(frequentonolddocuments). Fromseveral testing sessions,
weevaluate the timeneeded to correctly extract a fontbetween30 seconds (when theOCRworks
accurately)and60min(whentheOCRfailsandtheuserhas tomanuallyextract thecharacters).
Once the font isextracted, thebackgroundof thedocumentcanbecomputed. Thisbackground
extraction isperformedcompletelyautomatically. For thatpurpose,weapplyan inpaintingmethodto
removeall thecharacters.Weuse theOpenCVimplementationof [42].
To construct a realistic document, the layout of the document image is also extracted.
Document imagephysical layoutanalysisalgorithmscanbecategorized into threeclasses: top-down
approaches [43],bottom-upapproaches [44]andhybridapproaches [45,46].Aswordsegmentation is
alreadyavailableviaTesseractOCR(butnot thecompletedocument layout),weuseahybridapproach
proposedby [45]. Withonlyoneparameter to adjust thenumberof extractedblocks, thismethod
ensuresagoodlayoutsegmentationofmanydifferentclassesof typewrittendocuments.DocCreator,
as an interactive software, leaves oncemore the possibility to adapt to thewished segmentation
results. Thismethodhas theadvantageofavery lowcomputationalcost,withoutanypreprocessing
trainingrequired.
At thispoint, the threecharacteristicsusedin thesynthetic imagegenerationprocesshavebeen
extracted (background, font and layout). Thenext step is toassemble theseelementswithagiven
text inorder tobuild theļ¬naloutput,which is thecreatedsynthetic imageandtheassociatedXML
groundtruth. Figure2 illustratesasynthetic image(right) createdautomatically fromagivenoriginal
document image(left).As thisexample illustrates, acompleteautomaticgenerationmaystillproduce
perfectible results. Inparticular, if theoriginal imagesuffers fromlocaldeformations (as theoriginal
image inFigure2), thecharactersextractedtobuild thefontmayhavedifferent formsorsizes,and,
whenassembledtocompose theļ¬naldocument,maylocally looktoodifferentandthusnot realistic.
Obviously,onecancombine fonts,backgroundimages, layout fromdifferent imagesandvarious
texts, togeneratemanyofsyntheticdocument images.
174
back to the
book Document Image Processing"
Document Image Processing
- Title
- Document Image Processing
- Authors
- Ergina Kavallieratou
- Laurence Likforman-Sulem
- Editor
- MDPI
- Location
- Basel
- Date
- 2018
- Language
- German
- License
- CC BY-NC-ND 4.0
- ISBN
- 978-3-03897-106-1
- Size
- 17.0 x 24.4 cm
- Pages
- 216
- Keywords
- document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
- Category
- Informatik