Seite - 128 - in Document Image Processing
Bild der Seite - 128 -
Text der Seite - 128 -
Journal of
Imaging
Article
TranscriptionofSpanishHistoricalHandwritten
DocumentswithDeepNeuralNetworks
EmilioGranell 1,*,EdgardChammas2,LaurenceLikforman-Sulem3,
Carlos-D.Martínez-Hinarejos1,ChaficMokbel 2 andBogdan-Ionut¸Cîrstea3
1 PRHLTResearchCenter,UniversitatPolitècnicadeValència,46022València,Spain; cmartine@dsic.upv.es
2 DepartmentofComputerEngineering,UniversityofBalamand,2960Balamand,Lebanon;
edgard@balamand.edu.lb (E.C.); chafic.mokbel@balamand.edu.lb (C.M.)
3 InstitutMines-Télécom/TélécomParisTech,UniversitéParis-Saclay,75013Paris,France;
likforman@telecom-paristech.fr (L.L.-S.);bogdan-ionut.cirstea@telecom-paristech.fr (B.-I.C.)
* Correspondence: egranell@dsic.upv.es
Received: 30October2017;Accepted: 2 January2018;Published: 11 January2018
Abstract: The digitization of historical handwritten document images is important for
the preservation of cultural heritage.Moreover, the transcription of text images obtained from
digitization isnecessary toprovideefficient informationaccess to thecontentof thesedocuments.
HandwrittenTextRecognition(HTR)hasbecomeanimportant researchtopic in theareasof image
andcomputational languageprocessing that allowsus toobtain transcriptions fromtext images.
State-of-the-artHTRsystemsare,however, far fromperfect.Onedifficulty is that theyhavetocope
with imagenoiseandhandwritingvariability.Anotherdifficulty is thepresenceofa largeamount
ofOut-Of-Vocabulary(OOV)words inancienthistorical texts.Asolutionto thisproblemis touse
external lexical resources,but suchresourcesmightbescarceorunavailablegiven thenatureand
theageof suchdocuments. Thisworkproposes a solution toavoid this limitation. It consists of
associatingapowerfulopticalrecognitionsystemthatwillcopewithimagenoiseandvariability,with
a languagemodelbasedonsub-lexicalunits thatwillmodelOOVwords. Sucha languagemodeling
approachreduces thesizeof the lexiconwhile increasingthe lexiconcoverage. Experimentsarefirst
conductedon thepubliclyavailableRodrigodataset,whichcontains thedigitizationofanancient
Spanishmanuscript, with a recognizer based onHiddenMarkovModels (HMMs). They show
that sub-lexicalunitsoutperformwordunits in termsofWordErrorRate (WER),CharacterError
Rate (CER) andOOVword accuracy rate. This approach is then applied to deepnet classifiers,
namelyBi-directionalLong-ShortTermMemory (BLSTMs) andConvolutionalRecurrentNeural
Nets (CRNNs). ResultsshowthatCRNNsoutperformHMMsandBLSTMs,reachingthe lowestWER
andCERfor this imagedatasetandsignificantly improvingOOVrecognition.
Keywords:historicalhandwritten transcription;out-of-vocabularywordrecognition; character-level
languagemodel;wordstructureretrieval
1. Introduction
Thedigitizationofhistoricalhandwrittendocument images is important for thepreservationof
culturalheritage.Moreover, the transcriptionof text imagesobtainedfromdigitization isnecessary
toprovideefficient informationaccess to thecontentof thesedocuments. Automatic transcription
of these documents is performed by Handwriting Text Recognition (HTR) systems, which are
traditionally composed of an opticalmodel, a dictionary and aLanguageModel (LM).However,
HTRsystems face several challenges at both the image and languagemodeling levels. Historical
document imagesmayincludedefectsduetoage,manipulationandbleed-throughof ink. Theymay
also includecalligraphic initial lettersand longcharacter strokesasornaments. This isparticularly
J. Imaging 2018,4, 15 128 www.mdpi.com/journal/jimaging
zurück zum
Buch Document Image Processing"
Document Image Processing
- Titel
- Document Image Processing
- Autoren
- Ergina Kavallieratou
- Laurence Likforman-Sulem
- Herausgeber
- MDPI
- Ort
- Basel
- Datum
- 2018
- Sprache
- deutsch
- Lizenz
- CC BY-NC-ND 4.0
- ISBN
- 978-3-03897-106-1
- Abmessungen
- 17.0 x 24.4 cm
- Seiten
- 216
- Schlagwörter
- document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
- Kategorie
- Informatik