Page - 172 - in Document Image Processing
Image of the Page - 172 -
Text of the Page - 172 -
J. Imaging 2017,3, 62
are reliablyannotated, copyright-free,up-to-dateoreasilyavailable todownload. Analternative for
researchersanddigitalcurators is tocreate theirowngroundtruthbymanuallyannotatingdocument
images. Inorder toassist theminthetedioustaskofgroundtruthcreation,multiplesoftwarehavebeen
proposedduringthe last twodecades.
Asdetailed inTable 1, someare fullymanual stand-alone software (PinkPanther (1998) [17],
trueViz (2003) [18]), while others provide semi-automatic annotationmodules (GEDI (2010) [19],
Aletheia(2011)[20,21]). Someofthemostrecentsolutionsarebasedonanonlinecollaborativeplatform
(Transcriptorium(2014) [22],DIVADIAWI [23] (2015), [24] (2016),Recitalmanuscriptplatform[25]
(2017)).Amongnonopen-sourcesolutions, somehaveanacademic licence: [20,26]. Thesesoftware
assist theuser increating thegroundtruthassociatedwith realdocuments, intrinsically limited in
numberbecauseofacquisitionproceduresandcopyright issues.Moreover,despite theuseof such
software,manualannotationremainsacostly taskthatcannotalwaysbeperformedbyanon-specialist.
Another solution is available for getting (quickly and with lower human cost) large
ground-trutheddocument image datasets. This solution, investigated since the beginning of the
nineties [27], is to generate synthetic imageswith controlledground truth. The authors of [28,29]
propose two similar systems. They consist of using a text editor (e.g.,Word-office, Latex, etc.) to
automaticallycreatemultipledocumentswithvariedcontents (in termsof font,background, layout).
Alternativeapproachesconsistof re-arranging, inanewway, elementsextracted fromreal images
soas togenerate (manually, semi-automaticallyorautomatically)multiplesemi-syntheticdocument
images [12,30]. Recently, inparticularwith theadventofdeep learning techniqueswhichrequirehuge
masses of trainingdata, theneed for synthetic data generation seems tobe ever-growing. In [31],
amongthe60,000characterpatchesthatwereusedtotrainaconvolutionalnetworkfortextrecognition,
only3000werereal.
In thispaperwepresentDocCreator,anopen-sourceandmulti-platformsoftware that isable to
createvirtuallyunlimitedamountsofdifferentground-truthedsyntheticdocument imagesbasedon
asmallnumberof real images.
Table 1. Technical and functional characteristics of existing annotation software. Six features are
presented: export format, source availability, desktop/online software, groundtruthing assistance
(whether the software provides features that help the user to quickly create the groundtruth),
collaborative/crowd-sourcingsoftware,andyearofdistribution.
Export Open-Source Desktop/Online GroundtruthingAssistance Collaborative Year
Softwareformanualgroundtruthcreation
PinkPanther [17] ASCII n/a desktop no no 1998
TrueViz [18] XML yes desktop no no 2003
PerfectDoc[32] XML yes desktop ? no 2005
PixLabeler [33] XML no desktop no no 2009
GEDI[19] XML yes desktop yes no 2010
DAE[34] no yes online yes yes 2011
Aletheia [20,26] XML no online/desktop yes no 2011
Transcriptorium[22] TEI-XML no online yes yes 2014
DIVADIAWI[23] XML n/a online yes n/a 2015
Recital [25] no yes online yes yes 2017
Algorithmsforsyntheticdataaugmentation
Bairdetal. [27] no n/a n/a n/a no 1990
Zhaoetal. [28] no n/a n/a n/a no 2005
Delalandreetal. [12] no n/a n/a n/a no 2010
Yinetal. [30] no n/a n/a n/a no 2013
Masetal. [24] no n/a n/a n/a yes 2016
Seuretetal. [35] no yes n/a n/a no 2015
Softwareforsemi-automaticgroundtruthcreationanddataaugmentationcapabilities
DocCreator XML yes online/desktop yes no 2017
As illustrated in Figure 1, DocCreator can handle the creation of ground-truthed synthetic
images froma limited set of real images. Various realistic degradationmodels canbe appliedon
172
back to the
book Document Image Processing"
Document Image Processing
- Title
- Document Image Processing
- Authors
- Ergina Kavallieratou
- Laurence Likforman-Sulem
- Editor
- MDPI
- Location
- Basel
- Date
- 2018
- Language
- German
- License
- CC BY-NC-ND 4.0
- ISBN
- 978-3-03897-106-1
- Size
- 17.0 x 24.4 cm
- Pages
- 216
- Keywords
- document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
- Category
- Informatik