Seite - 130 - in Document Image Processing
Bild der Seite - 130 -
Text der Seite - 130 -
J. Imaging 2018,4, 15
of different characters, and the transitionprobabilities between the charactermodels aregivenby
a character LM.Character-basedLMsare alsouseful for related tasks such asword spotting [10].
In thepreviouscharacterLMapproachoreveningeneralwordLMapproaches, theopticalmodels
stillmodelcharacters.However, inworkssuchas [11,12], theopticalmodelsmodelstrokes thatare
concatenatedto formwords.
OpticalModel
LanguageModel y pequeños
LexicalModel y p e q u e ñ o s ypequeños
Figure2.Schemeofahandwritten text recognitionsystem.
Whenaword-baseddictionaryhelps therecognitionprocess, thehandwritingrecognitionsystem
canonly transcribea limitednumberofwords. Thesizeof thedictionary isacompromisebetween
atoo largesizeyieldingwordconfusionsandatoosmalloneyieldingmanyunknownwords.Words
of the test set that arenotpresent in theHTRdictionaryaredenotedasOut-Of-Vocabulary (OOV)
words. Several typesofOOVwordsexist, suchascommonwordsusinga lesscommongrammatical
form,misspellings,wordsattachedtopunctuationmarks,hyphenatedwordsorwordscontainingrare
characters (abbreviations, special signs,etc.).
AnapproachtocopewithOOVwordsconsistsofextendingthedictionarywithexternal lexical
resources, such asWikipedia [13], or in the case of historical documents, with the transcription
of other documents from the same period and topic [14]. From these resources, the language
model can also be refined. However, in the general case, such resourcesmay not be available,
andaproportionofwords (suchasnamedentities andrarewords) still remainsasOOV.Another
approach for copingwithOOVwordsconsistsofmodeling textat a sub-word level, asa sequence
ofcharacters, syllablesormulti-grams[15].Hybridapproaches [16,17]consistofusingword-based
languagemodels for themost frequentwordsandcharacter-basedmodels for the less frequentones.
Insub-wordapproaches, thedictionary isconsiderablyreducedto thenumberof lexicalunits, aswell
as the computational complexity. In addition, the languagemodel canmodelunknownwordsby
combiningsuch lexicalunits.
In thiswork,we compare severalHTR systems, based onHMMs, RNNs and convolutional
RNNs(CRNNs). TheCRNNis inspiredfromaverydeeparchitecturepresented in [18]. It consists
of stackingBLSTMsandassociating themwithconvolutional layers. Featuresare thusautomatically
extractedbytheconvolutional layersandprocessedbytheBLSTMlayers.Wealsomodeldictionaries
and language models of our HTR systems with sub-word units. We apply this approach to
therecognitionofapubliclyavailableSpanishhistoricaldocumentsdataset.WecompareseveralHTR
systemsbasedondifferenttypesofsub-wordunits,andweshowthatsub-wordunitsaremoreefficient
thanwordunits.Weobtain, toourknowledge, thebest recognitionresultsonthisSpanishdatasetby
associatingsub-wordunitswith thedeepestHTRoptical system,namelytheCRNN.Wealsoobtain
highrates for therecognitionofOOVwords.
The rest of the paper is structured as follows: the Spanish historical manuscript used in
theexperimentation ispresented in thenextsection(Section2); theHTRsystemsandtheexperimental
conditionsaredescribedinSection3;ourexperimentsandtheobtainedresultsarereportedinSection4;
theconclusionsandfutureworkaredrawninSection5;finally, inAppendixA,several recognition
examplesareshown.
2.TheRodrigoDataset
TheRodrigo corpus [19]was obtained from thedigitization of the book “Historia de España
del arçobispoDonRodrigo”,written in ancient Spanish in 1545. It is a singlewriter bookwhere
mostpages consist of a single blockofwell-separated lines of calligraphical text, as the examples
130
zurück zum
Buch Document Image Processing"
Document Image Processing
- Titel
- Document Image Processing
- Autoren
- Ergina Kavallieratou
- Laurence Likforman-Sulem
- Herausgeber
- MDPI
- Ort
- Basel
- Datum
- 2018
- Sprache
- deutsch
- Lizenz
- CC BY-NC-ND 4.0
- ISBN
- 978-3-03897-106-1
- Abmessungen
- 17.0 x 24.4 cm
- Seiten
- 216
- Schlagwörter
- document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
- Kategorie
- Informatik