Web-Books
in the Austria-Forum
Austria-Forum
Web-Books
Informatik
Document Image Processing
Page - 102 -
  • User
  • Version
    • full version
    • text only version
  • Language
    • Deutsch - German
    • English

Page - 102 - in Document Image Processing

Image of the Page - 102 -

Image of the Page - 102 - in Document Image Processing

Text of the Page - 102 -

J. Imaging 2018,4, 43 IAMHistorical DocumentDatabase (IAM-HistDB) (http://www.fki.inf.unibe.ch/databases/iam- historical-document-database) [3],which includeshandwrittenhistoricalmanuscript images from theSaintGallDatabase fromthe9thcentury inLatin; theParzivalDatabase fromthe13thcentury inGerman; theWashingtonDatabase from the 18th century inEnglish; theAncient Lives Project (https://www.ancientlives.org/) [4],whichasksvolunteers to transcribeAncientGreektext fragments fromtheOxyrhynchusPapyri collection;andmanyotherprojects. Toaccelerate theprocessofaccessing,preserving,anddisseminating thecontentsof theheritage documents, a DIA system is needed. Besides aiming to preserve the existence of such ancient documents physically, the DIA system is expected to enable open access to the contents of the documentsandprovideopportunities forawideraudience toaccessall the important information stored in the document. DIA is the process of using various technologies to extract text, printed orhandwritten,andgraphics fromdigitizeddocumentfiles (http://www.cvisiontech.com/library/ pdf/pdf-document/document-image-analysis.html) [5]. DIAsystemsgenerallyhaveamajor role in identifying,analyzing,extracting, structuring,andtransferringdocumentcontentsmorequickly, effectively,andefficiently. Thissystemisable toworksemi-automaticallyorevenfullyautomatically withouthumanintervention. TheDIAsystemisexpectedtosave time, cost, andeffortatmanypoints in theheritagedocumentpreservationprocess. However,althoughtheDIAresearchdevelopsrapidly, it isundeniable thatmostof thedocument collectionsusedinthe initial steparefromdevelopedregionssuchasAmericaandEuropeancountries. The document samples from these countries are mostly written in English or old English with Latin/Romanscript. Severalimportantdocumentcollectionswerefinallyusedasstandardbenchmarks fortheevaluationofthelatestDIAresearchresults. ThenextwaveofDIAresearchfinallybegantodeal withdocuments fromnon-English-speakingareaswithnon-Latinscripts, suchasArabic,Chinese,and Japanesedocuments.DuringtheevolutionofDIAresearch in the last twodecades,DIAresearchers haveproposedandachievedsatisfactorysolutions formanycomplexproblemsofdocumentanalysis for these typesofdocuments.However, theDIAresearchchallenge isongoing. The latest challenge is documents fromAsia,withnewlanguagesandmorecomplexscripts toexplore, suchasDevanagari script [6], Gurmukhi script [7–10], Bangla script [11], andMalayalam script [12], and the case of multiple languages and scripts indocuments fromIndia. Optical character recognition (OCR) for Indian languages isconsideredmoredifficult ingeneral thanforEuropeanlanguagesbecauseof the largenumberofvowels, consonants,andconjuncts (combinationsofvowelsandconsonants) [13]. This workwas part of exploringDIA research for a palm leafmanuscripts collection from SoutheastAsia. This collectionoffersanewchallenge forDIAresearchersbecausepalmleavesare usedas thewritingmediumandthe languageandscripthaveneverbeenanalyzedbefore. In this paper, wedid a comprehensive benchmark experimental test of someprincipal tasks in theDIA system,startingwithbinarization, text linesegmentation, isolatedcharacter/glyphrecognition,word recognition,andtransliteration. To thebestofourknowledge, thiswork is thefirst comprehensive studyof theDIAresearchers’ communityandthefirst toperformacompleteseriesofexperimental benchmarking analyses of palm leafmanuscripts. The results of this researchwill be veryuseful inaccelerating,evaluating,andimprovingtheperformanceofexistingDIAsystemsforanewtype ofdocument. Thispaper isorganizedas follow. Section2givesabriefdescriptionof thepalmleafmanuscripts collection fromSoutheastAsia, especially theKhmerpalmleafmanuscript corpus fromCambodia and twopalm leafmanuscript corpuses, theBalineseandSundanesemanuscripts fromIndonesia. ThechallengesofDIAfor thismanuscript corpusarealsopresented in this section. Section3describes theDIA tasks that need to be developed for the palm leafmanuscript collections, followed by a descriptionof themethods investigatedfor those tasks. Thedatasetsandevaluationmethods foreach DIAtaskusedintheexperimental studies for thisworkarepresented inSection4. Section5reports andanalyzes thedetailedresultsof theexperiments. Finally, conclusionsaregiven inSection6. 102
back to the  book Document Image Processing"
Document Image Processing
Title
Document Image Processing
Authors
Ergina Kavallieratou
Laurence Likforman-Sulem
Editor
MDPI
Location
Basel
Date
2018
Language
German
License
CC BY-NC-ND 4.0
ISBN
978-3-03897-106-1
Size
17.0 x 24.4 cm
Pages
216
Keywords
document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
Category
Informatik
Web-Books
Library
Privacy
Imprint
Austria-Forum
Austria-Forum
Web-Books
Document Image Processing