Seite - 102 - in Document Image Processing

Bild der Seite - 102 -

Text der Seite - 102 -

J. Imaging 2018,4, 43 IAMHistorical DocumentDatabase (IAM-HistDB) (http://www.fki.inf.unibe.ch/databases/iam- historical-document-database) [3],which includeshandwrittenhistoricalmanuscript images from theSaintGallDatabase fromthe9thcentury inLatin; theParzivalDatabase fromthe13thcentury inGerman; theWashingtonDatabase from the 18th century inEnglish; theAncient Lives Project (https://www.ancientlives.org/) [4],whichasksvolunteers to transcribeAncientGreektext fragments fromtheOxyrhynchusPapyri collection;andmanyotherprojects. Toaccelerate theprocessofaccessing,preserving,anddisseminating thecontentsof theheritage documents, a DIA system is needed. Besides aiming to preserve the existence of such ancient documents physically, the DIA system is expected to enable open access to the contents of the documentsandprovideopportunities forawideraudience toaccessall the important information stored in the document. DIA is the process of using various technologies to extract text, printed orhandwritten,andgraphics fromdigitizeddocumentﬁles (http://www.cvisiontech.com/library/ pdf/pdf-document/document-image-analysis.html) [5]. DIAsystemsgenerallyhaveamajor role in identifying,analyzing,extracting, structuring,andtransferringdocumentcontentsmorequickly, effectively,andefﬁciently. Thissystemisable toworksemi-automaticallyorevenfullyautomatically withouthumanintervention. TheDIAsystemisexpectedtosave time, cost, andeffortatmanypoints in theheritagedocumentpreservationprocess. However,althoughtheDIAresearchdevelopsrapidly, it isundeniable thatmostof thedocument collectionsusedinthe initial steparefromdevelopedregionssuchasAmericaandEuropeancountries. The document samples from these countries are mostly written in English or old English with Latin/Romanscript. Severalimportantdocumentcollectionswereﬁnallyusedasstandardbenchmarks fortheevaluationofthelatestDIAresearchresults. ThenextwaveofDIAresearchﬁnallybegantodeal withdocuments fromnon-English-speakingareaswithnon-Latinscripts, suchasArabic,Chinese,and Japanesedocuments.DuringtheevolutionofDIAresearch in the last twodecades,DIAresearchers haveproposedandachievedsatisfactorysolutions formanycomplexproblemsofdocumentanalysis for these typesofdocuments.However, theDIAresearchchallenge isongoing. The latest challenge is documents fromAsia,withnewlanguagesandmorecomplexscripts toexplore, suchasDevanagari script [6], Gurmukhi script [7–10], Bangla script [11], andMalayalam script [12], and the case of multiple languages and scripts indocuments fromIndia. Optical character recognition (OCR) for Indian languages isconsideredmoredifﬁcult ingeneral thanforEuropeanlanguagesbecauseof the largenumberofvowels, consonants,andconjuncts (combinationsofvowelsandconsonants) [13]. This workwas part of exploringDIA research for a palm leafmanuscripts collection from SoutheastAsia. This collectionoffersanewchallenge forDIAresearchersbecausepalmleavesare usedas thewritingmediumandthe languageandscripthaveneverbeenanalyzedbefore. In this paper, wedid a comprehensive benchmark experimental test of someprincipal tasks in theDIA system,startingwithbinarization, text linesegmentation, isolatedcharacter/glyphrecognition,word recognition,andtransliteration. To thebestofourknowledge, thiswork is theﬁrst comprehensive studyof theDIAresearchers’ communityandtheﬁrst toperformacompleteseriesofexperimental benchmarking analyses of palm leafmanuscripts. The results of this researchwill be veryuseful inaccelerating,evaluating,andimprovingtheperformanceofexistingDIAsystemsforanewtype ofdocument. Thispaper isorganizedas follow. Section2givesabriefdescriptionof thepalmleafmanuscripts collection fromSoutheastAsia, especially theKhmerpalmleafmanuscript corpus fromCambodia and twopalm leafmanuscript corpuses, theBalineseandSundanesemanuscripts fromIndonesia. ThechallengesofDIAfor thismanuscript corpusarealsopresented in this section. Section3describes theDIA tasks that need to be developed for the palm leafmanuscript collections, followed by a descriptionof themethods investigatedfor those tasks. Thedatasetsandevaluationmethods foreach DIAtaskusedintheexperimental studies for thisworkarepresented inSection4. Section5reports andanalyzes thedetailedresultsof theexperiments. Finally, conclusionsaregiven inSection6. 102

zurück zum Buch Document Image Processing"

Document Image Processing

Titel: Document Image Processing
Autoren: Ergina Kavallieratou; Laurence Likforman-Sulem
Herausgeber: MDPI
Ort: Basel
Datum: 2018
Sprache: deutsch
Lizenz: CC BY-NC-ND 4.0
ISBN: 978-3-03897-106-1
Abmessungen: 17.0 x 24.4 cm
Seiten: 216
Schlagwörter: document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
Kategorie: Informatik