Web-Books
im Austria-Forum
Austria-Forum
Web-Books
Informatik
Document Image Processing
Seite - 128 -
  • Benutzer
  • Version
    • Vollversion
    • Textversion
  • Sprache
    • Deutsch
    • English - Englisch

Seite - 128 - in Document Image Processing

Bild der Seite - 128 -

Bild der Seite - 128 - in Document Image Processing

Text der Seite - 128 -

Journal of Imaging Article TranscriptionofSpanishHistoricalHandwritten DocumentswithDeepNeuralNetworks EmilioGranell 1,*,EdgardChammas2,LaurenceLikforman-Sulem3, Carlos-D.Martínez-Hinarejos1,ChaficMokbel 2 andBogdan-Ionut¸Cîrstea3 1 PRHLTResearchCenter,UniversitatPolitècnicadeValència,46022València,Spain; cmartine@dsic.upv.es 2 DepartmentofComputerEngineering,UniversityofBalamand,2960Balamand,Lebanon; edgard@balamand.edu.lb (E.C.); chafic.mokbel@balamand.edu.lb (C.M.) 3 InstitutMines-Télécom/TélécomParisTech,UniversitéParis-Saclay,75013Paris,France; likforman@telecom-paristech.fr (L.L.-S.);bogdan-ionut.cirstea@telecom-paristech.fr (B.-I.C.) * Correspondence: egranell@dsic.upv.es Received: 30October2017;Accepted: 2 January2018;Published: 11 January2018 Abstract: The digitization of historical handwritten document images is important for the preservation of cultural heritage.Moreover, the transcription of text images obtained from digitization isnecessary toprovideefficient informationaccess to thecontentof thesedocuments. HandwrittenTextRecognition(HTR)hasbecomeanimportant researchtopic in theareasof image andcomputational languageprocessing that allowsus toobtain transcriptions fromtext images. State-of-the-artHTRsystemsare,however, far fromperfect.Onedifficulty is that theyhavetocope with imagenoiseandhandwritingvariability.Anotherdifficulty is thepresenceofa largeamount ofOut-Of-Vocabulary(OOV)words inancienthistorical texts.Asolutionto thisproblemis touse external lexical resources,but suchresourcesmightbescarceorunavailablegiven thenatureand theageof suchdocuments. Thisworkproposes a solution toavoid this limitation. It consists of associatingapowerfulopticalrecognitionsystemthatwillcopewithimagenoiseandvariability,with a languagemodelbasedonsub-lexicalunits thatwillmodelOOVwords. Sucha languagemodeling approachreduces thesizeof the lexiconwhile increasingthe lexiconcoverage. Experimentsarefirst conductedon thepubliclyavailableRodrigodataset,whichcontains thedigitizationofanancient Spanishmanuscript, with a recognizer based onHiddenMarkovModels (HMMs). They show that sub-lexicalunitsoutperformwordunits in termsofWordErrorRate (WER),CharacterError Rate (CER) andOOVword accuracy rate. This approach is then applied to deepnet classifiers, namelyBi-directionalLong-ShortTermMemory (BLSTMs) andConvolutionalRecurrentNeural Nets (CRNNs). ResultsshowthatCRNNsoutperformHMMsandBLSTMs,reachingthe lowestWER andCERfor this imagedatasetandsignificantly improvingOOVrecognition. Keywords:historicalhandwritten transcription;out-of-vocabularywordrecognition; character-level languagemodel;wordstructureretrieval 1. Introduction Thedigitizationofhistoricalhandwrittendocument images is important for thepreservationof culturalheritage.Moreover, the transcriptionof text imagesobtainedfromdigitization isnecessary toprovideefficient informationaccess to thecontentof thesedocuments. Automatic transcription of these documents is performed by Handwriting Text Recognition (HTR) systems, which are traditionally composed of an opticalmodel, a dictionary and aLanguageModel (LM).However, HTRsystems face several challenges at both the image and languagemodeling levels. Historical document imagesmayincludedefectsduetoage,manipulationandbleed-throughof ink. Theymay also includecalligraphic initial lettersand longcharacter strokesasornaments. This isparticularly J. Imaging 2018,4, 15 128 www.mdpi.com/journal/jimaging
zurück zum  Buch Document Image Processing"
Document Image Processing
Titel
Document Image Processing
Autoren
Ergina Kavallieratou
Laurence Likforman-Sulem
Herausgeber
MDPI
Ort
Basel
Datum
2018
Sprache
deutsch
Lizenz
CC BY-NC-ND 4.0
ISBN
978-3-03897-106-1
Abmessungen
17.0 x 24.4 cm
Seiten
216
Schlagwörter
document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
Kategorie
Informatik
Web-Books
Bibliothek
Datenschutz
Impressum
Austria-Forum
Austria-Forum
Web-Books
Document Image Processing