Page - 174 - in Document Image Processing

Image of the Page - 174 -

Text of the Page - 174 -

J. Imaging 2017,3, 62 the extracted font, onto the reconstructedbackground,witha layout similar tooneof theoriginal documents (seeFigure2). The font is extractedusinga semi-automaticmethod. First, anOpticalCharacterRecognition system (OCR) automatically associates a label (Unicode value) to each character on the image. TesseractOCRengine [40] isused. Thentheusercanmodify thecharacterproperties (label,baseline, margins, etc.). Our softwareallowsus topair agivensymbolwith several character images in the extractedfont. Thus,whenthe font isusedtowritea text, thesoftwarewill randomlychooseanimage for the required symbol. Thiswillmake theﬁnal output lookmore realistic by reducing the strict uniformitybetweensimilar letters. Inorder tocorrectlywrite textwith this font, thebaselineofeachcharacterhas tobecomputed. Thebaseline is the imaginarystraight lineonwhichsemicursiveorcursive textarealignedandupon whichmost letters “sit” and, belowwhich, descenders extend. In order to extract each character baselineanddealwithvariousdocuments,weproposeadifferentapproachfromclassicalones [40,41]. Themain originality of our baseline extractionmethod is that the baseline is computed for each character individually insteadofﬁndingabaseline for thewhole line.Weevaluatedthismethodon more than5000manuallyannotatedbaselinesconsideredas thegroundtruth. Thebaselineextraction error rate is relatively close the one obtained using [40] (the same baseline extraction error rate). However,ourmethodhas themainadvantageofbeingrobust toskeworientation, thehand-written wavypatternandunalignedcolumnsinasamepage. Besides, theinterwordsdistanceisautomatically computedas theaverageofcharacterswidth. For the intercharactersdistance,none isspeciﬁedby default. However, theuser can interactivelyspecify the left andrightmargin tobetterposition the characterrelativetoothers. TheGUIofDocCreatoralsoallowstheusertomodifyseveralparametersto improve theextractedcharacters. Theusercanchange thebaselineor the letterassignedtoacharacter andsmooththeborderofacharacter.Via this semi-automatic fontextractionmethod, theuser isable to correctmistakesmadeby theOCR(frequentonolddocuments). Fromseveral testing sessions, weevaluate the timeneeded to correctly extract a fontbetween30 seconds (when theOCRworks accurately)and60min(whentheOCRfailsandtheuserhas tomanuallyextract thecharacters). Once the font isextracted, thebackgroundof thedocumentcanbecomputed. Thisbackground extraction isperformedcompletelyautomatically. For thatpurpose,weapplyan inpaintingmethodto removeall thecharacters.Weuse theOpenCVimplementationof [42]. To construct a realistic document, the layout of the document image is also extracted. Document imagephysical layoutanalysisalgorithmscanbecategorized into threeclasses: top-down approaches [43],bottom-upapproaches [44]andhybridapproaches [45,46].Aswordsegmentation is alreadyavailableviaTesseractOCR(butnot thecompletedocument layout),weuseahybridapproach proposedby [45]. Withonlyoneparameter to adjust thenumberof extractedblocks, thismethod ensuresagoodlayoutsegmentationofmanydifferentclassesof typewrittendocuments.DocCreator, as an interactive software, leaves oncemore the possibility to adapt to thewished segmentation results. Thismethodhas theadvantageofavery lowcomputationalcost,withoutanypreprocessing trainingrequired. At thispoint, the threecharacteristicsusedin thesynthetic imagegenerationprocesshavebeen extracted (background, font and layout). Thenext step is toassemble theseelementswithagiven text inorder tobuild theﬁnaloutput,which is thecreatedsynthetic imageandtheassociatedXML groundtruth. Figure2 illustratesasynthetic image(right) createdautomatically fromagivenoriginal document image(left).As thisexample illustrates, acompleteautomaticgenerationmaystillproduce perfectible results. Inparticular, if theoriginal imagesuffers fromlocaldeformations (as theoriginal image inFigure2), thecharactersextractedtobuild thefontmayhavedifferent formsorsizes,and, whenassembledtocompose theﬁnaldocument,maylocally looktoodifferentandthusnot realistic. Obviously,onecancombine fonts,backgroundimages, layout fromdifferent imagesandvarious texts, togeneratemanyofsyntheticdocument images. 174

back to the book Document Image Processing"

Document Image Processing

Title: Document Image Processing
Authors: Ergina Kavallieratou; Laurence Likforman-Sulem
Editor: MDPI
Location: Basel
Date: 2018
Language: German
License: CC BY-NC-ND 4.0
ISBN: 978-3-03897-106-1
Size: 17.0 x 24.4 cm
Pages: 216
Keywords: document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
Category: Informatik