Seite - 102 - in Document Image Processing
Bild der Seite - 102 -
Text der Seite - 102 -
J. Imaging 2018,4, 43
IAMHistorical DocumentDatabase (IAM-HistDB) (http://www.fki.inf.unibe.ch/databases/iam-
historical-document-database) [3],which includeshandwrittenhistoricalmanuscript images from
theSaintGallDatabase fromthe9thcentury inLatin; theParzivalDatabase fromthe13thcentury
inGerman; theWashingtonDatabase from the 18th century inEnglish; theAncient Lives Project
(https://www.ancientlives.org/) [4],whichasksvolunteers to transcribeAncientGreektext fragments
fromtheOxyrhynchusPapyri collection;andmanyotherprojects.
Toaccelerate theprocessofaccessing,preserving,anddisseminating thecontentsof theheritage
documents, a DIA system is needed. Besides aiming to preserve the existence of such ancient
documents physically, the DIA system is expected to enable open access to the contents of the
documentsandprovideopportunities forawideraudience toaccessall the important information
stored in the document. DIA is the process of using various technologies to extract text, printed
orhandwritten,andgraphics fromdigitizeddocumentfiles (http://www.cvisiontech.com/library/
pdf/pdf-document/document-image-analysis.html) [5]. DIAsystemsgenerallyhaveamajor role
in identifying,analyzing,extracting, structuring,andtransferringdocumentcontentsmorequickly,
effectively,andefficiently. Thissystemisable toworksemi-automaticallyorevenfullyautomatically
withouthumanintervention. TheDIAsystemisexpectedtosave time, cost, andeffortatmanypoints
in theheritagedocumentpreservationprocess.
However,althoughtheDIAresearchdevelopsrapidly, it isundeniable thatmostof thedocument
collectionsusedinthe initial steparefromdevelopedregionssuchasAmericaandEuropeancountries.
The document samples from these countries are mostly written in English or old English with
Latin/Romanscript. Severalimportantdocumentcollectionswerefinallyusedasstandardbenchmarks
fortheevaluationofthelatestDIAresearchresults. ThenextwaveofDIAresearchfinallybegantodeal
withdocuments fromnon-English-speakingareaswithnon-Latinscripts, suchasArabic,Chinese,and
Japanesedocuments.DuringtheevolutionofDIAresearch in the last twodecades,DIAresearchers
haveproposedandachievedsatisfactorysolutions formanycomplexproblemsofdocumentanalysis
for these typesofdocuments.However, theDIAresearchchallenge isongoing. The latest challenge is
documents fromAsia,withnewlanguagesandmorecomplexscripts toexplore, suchasDevanagari
script [6], Gurmukhi script [7–10], Bangla script [11], andMalayalam script [12], and the case of
multiple languages and scripts indocuments fromIndia. Optical character recognition (OCR) for
Indian languages isconsideredmoredifficult ingeneral thanforEuropeanlanguagesbecauseof the
largenumberofvowels, consonants,andconjuncts (combinationsofvowelsandconsonants) [13].
This workwas part of exploringDIA research for a palm leafmanuscripts collection from
SoutheastAsia. This collectionoffersanewchallenge forDIAresearchersbecausepalmleavesare
usedas thewritingmediumandthe languageandscripthaveneverbeenanalyzedbefore. In this
paper, wedid a comprehensive benchmark experimental test of someprincipal tasks in theDIA
system,startingwithbinarization, text linesegmentation, isolatedcharacter/glyphrecognition,word
recognition,andtransliteration. To thebestofourknowledge, thiswork is thefirst comprehensive
studyof theDIAresearchers’ communityandthefirst toperformacompleteseriesofexperimental
benchmarking analyses of palm leafmanuscripts. The results of this researchwill be veryuseful
inaccelerating,evaluating,andimprovingtheperformanceofexistingDIAsystemsforanewtype
ofdocument.
Thispaper isorganizedas follow. Section2givesabriefdescriptionof thepalmleafmanuscripts
collection fromSoutheastAsia, especially theKhmerpalmleafmanuscript corpus fromCambodia
and twopalm leafmanuscript corpuses, theBalineseandSundanesemanuscripts fromIndonesia.
ThechallengesofDIAfor thismanuscript corpusarealsopresented in this section. Section3describes
theDIA tasks that need to be developed for the palm leafmanuscript collections, followed by a
descriptionof themethods investigatedfor those tasks. Thedatasetsandevaluationmethods foreach
DIAtaskusedintheexperimental studies for thisworkarepresented inSection4. Section5reports
andanalyzes thedetailedresultsof theexperiments. Finally, conclusionsaregiven inSection6.
102
zurück zum
Buch Document Image Processing"
Document Image Processing
- Titel
- Document Image Processing
- Autoren
- Ergina Kavallieratou
- Laurence Likforman-Sulem
- Herausgeber
- MDPI
- Ort
- Basel
- Datum
- 2018
- Sprache
- deutsch
- Lizenz
- CC BY-NC-ND 4.0
- ISBN
- 978-3-03897-106-1
- Abmessungen
- 17.0 x 24.4 cm
- Seiten
- 216
- Schlagwörter
- document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
- Kategorie
- Informatik