Web-Books
in the Austria-Forum
Austria-Forum
Web-Books
Informatik
Document Image Processing
Page - 174 -
  • User
  • Version
    • full version
    • text only version
  • Language
    • Deutsch - German
    • English

Page - 174 - in Document Image Processing

Image of the Page - 174 -

Image of the Page - 174 - in Document Image Processing

Text of the Page - 174 -

J. Imaging 2017,3, 62 the extracted font, onto the reconstructedbackground,witha layout similar tooneof theoriginal documents (seeFigure2). The font is extractedusinga semi-automaticmethod. First, anOpticalCharacterRecognition system (OCR) automatically associates a label (Unicode value) to each character on the image. TesseractOCRengine [40] isused. Thentheusercanmodify thecharacterproperties (label,baseline, margins, etc.). Our softwareallowsus topair agivensymbolwith several character images in the extractedfont. Thus,whenthe font isusedtowritea text, thesoftwarewill randomlychooseanimage for the required symbol. Thiswillmake thefinal output lookmore realistic by reducing the strict uniformitybetweensimilar letters. Inorder tocorrectlywrite textwith this font, thebaselineofeachcharacterhas tobecomputed. Thebaseline is the imaginarystraight lineonwhichsemicursiveorcursive textarealignedandupon whichmost letters ā€œsitā€ and, belowwhich, descenders extend. In order to extract each character baselineanddealwithvariousdocuments,weproposeadifferentapproachfromclassicalones [40,41]. Themain originality of our baseline extractionmethod is that the baseline is computed for each character individually insteadoffindingabaseline for thewhole line.Weevaluatedthismethodon more than5000manuallyannotatedbaselinesconsideredas thegroundtruth. Thebaselineextraction error rate is relatively close the one obtained using [40] (the same baseline extraction error rate). However,ourmethodhas themainadvantageofbeingrobust toskeworientation, thehand-written wavypatternandunalignedcolumnsinasamepage. Besides, theinterwordsdistanceisautomatically computedas theaverageofcharacterswidth. For the intercharactersdistance,none isspecifiedby default. However, theuser can interactivelyspecify the left andrightmargin tobetterposition the characterrelativetoothers. TheGUIofDocCreatoralsoallowstheusertomodifyseveralparametersto improve theextractedcharacters. Theusercanchange thebaselineor the letterassignedtoacharacter andsmooththeborderofacharacter.Via this semi-automatic fontextractionmethod, theuser isable to correctmistakesmadeby theOCR(frequentonolddocuments). Fromseveral testing sessions, weevaluate the timeneeded to correctly extract a fontbetween30 seconds (when theOCRworks accurately)and60min(whentheOCRfailsandtheuserhas tomanuallyextract thecharacters). Once the font isextracted, thebackgroundof thedocumentcanbecomputed. Thisbackground extraction isperformedcompletelyautomatically. For thatpurpose,weapplyan inpaintingmethodto removeall thecharacters.Weuse theOpenCVimplementationof [42]. To construct a realistic document, the layout of the document image is also extracted. Document imagephysical layoutanalysisalgorithmscanbecategorized into threeclasses: top-down approaches [43],bottom-upapproaches [44]andhybridapproaches [45,46].Aswordsegmentation is alreadyavailableviaTesseractOCR(butnot thecompletedocument layout),weuseahybridapproach proposedby [45]. Withonlyoneparameter to adjust thenumberof extractedblocks, thismethod ensuresagoodlayoutsegmentationofmanydifferentclassesof typewrittendocuments.DocCreator, as an interactive software, leaves oncemore the possibility to adapt to thewished segmentation results. Thismethodhas theadvantageofavery lowcomputationalcost,withoutanypreprocessing trainingrequired. At thispoint, the threecharacteristicsusedin thesynthetic imagegenerationprocesshavebeen extracted (background, font and layout). Thenext step is toassemble theseelementswithagiven text inorder tobuild thefinaloutput,which is thecreatedsynthetic imageandtheassociatedXML groundtruth. Figure2 illustratesasynthetic image(right) createdautomatically fromagivenoriginal document image(left).As thisexample illustrates, acompleteautomaticgenerationmaystillproduce perfectible results. Inparticular, if theoriginal imagesuffers fromlocaldeformations (as theoriginal image inFigure2), thecharactersextractedtobuild thefontmayhavedifferent formsorsizes,and, whenassembledtocompose thefinaldocument,maylocally looktoodifferentandthusnot realistic. Obviously,onecancombine fonts,backgroundimages, layout fromdifferent imagesandvarious texts, togeneratemanyofsyntheticdocument images. 174
back to the  book Document Image Processing"
Document Image Processing
Title
Document Image Processing
Authors
Ergina Kavallieratou
Laurence Likforman-Sulem
Editor
MDPI
Location
Basel
Date
2018
Language
German
License
CC BY-NC-ND 4.0
ISBN
978-3-03897-106-1
Size
17.0 x 24.4 cm
Pages
216
Keywords
document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
Category
Informatik
Web-Books
Library
Privacy
Imprint
Austria-Forum
Austria-Forum
Web-Books
Document Image Processing