Page - 133 - in Document Image Processing
Image of the Page - 133 -
Text of the Page - 133 -
J. Imaging 2018,4, 15
Agora cuenta la historia
wouldbetransformedinto the followingcharactersequence:
A g o r a~<SPACE> c u e n t a~<SPACE> l a~<SPACE> h i s t o r i a
or into the followingsequence followingthehyphenationrules forSpanish:
Ago ra <SPACE> cuen ta <SPACE> la <SPACE> his to ria
Then, thesepreprocessedtranscriptionscanbeusedto train thesub-wordunit languagemodel.
Usually,n-gramlanguagemodelsof sub-wordunits are trainedwitha largen (large context). On
theotherside, the lexicon is reducedtomatchthe listof sub-wordunits.
In thedecodingprocess, thebesthypothesis isprocessedtoobtain thefinalhypothesis. Thisfinal
processconsistsof collapsingthesub-wordunit sequence to formwordsandtosubstitute thesymbol
usedtomarktheseparationbetweenwords(<SPACE>)byaspace. Figure4presentsa text lineexample
fromthetestpartitionwhosereference transcription is:
vio e recognoscio el Astragamiento que perdiera de su gente
In this example, thewords recognoscio andAstragamiento areOOVwords. It is interesting to
note theiretymology. Theyarearchaic formsfromEarlyModernSpanish(15th–17thcentury) that in
ModernSpanishcorrespondtotheforms reconocióandEstragamiento. For that reason,wecouldnot
findtheminanyexternal resource,noteven inGoogleN-Grams[22].
Figure4.Text linesample. “Recognoscio”and“Astragamiento”arerarewords;recognoscio isanarchaic
formof reconocióandAstragamientoanancient formofEstragamiento.
The HMM decoding process with a traditional word-based approach offers the following
besthypothesis:
vno & rea gustio el Astragar mando que perdona de lugar
whichrepresentsaCharacterErrorRate (CER)equal to35.6%withrespect to thereference text-line
transcription.However,usingasub-wordbasedapproach, the followingbesthypothesis isobtained:
vio <SPACE> & <SPACE> re ca ges cio <SPACE> el <SPACE> As tra ga mien to
<SPACE> que <SPACE> per do na <SPACE> de <SPACE> lu gar <SPACE>
which is transformedinto the improvedhypothesis (CER=22.0%):
vio & recagescio el Astragamiento que perdona de lugar
Ontheotherhand,withacharacter-basedapproach, the followingbesthypothesis isobtained:
v i o <SPACE> & <SPACE> r e c e g e s c i o <SPACE> e l <SPACE> A s t r a~g a~m i e n t o
<SPACE> q u e <SPACE> p e r d i e r a~<SPACE> d e l <SPACE> s e g u n d o
whichresults in thenextfinalbesthypothesis (CER=17.0%):
vio & recegescio el Astragamiento que perdiera del segundo
Ascanbeobserved, thefinalhypothesesobtainedatsub-wordlevels (characters,hyphenation
sub-wordunits) inHTRareconsiderablybetter thanthoseobtainedwith theword-basedapproach.
In addition, theOOVwordAstragamiento has been fully recognized. The secondOOVword is
recognized as recegescio or recagescio, which also improves theword-based recognition rea gustio.
InSection4,wordandsub-word languagemodelingapproacheswillbecomparedwithseveral types
ofopticalHTRsystems.
133
back to the
book Document Image Processing"
Document Image Processing
- Title
- Document Image Processing
- Authors
- Ergina Kavallieratou
- Laurence Likforman-Sulem
- Editor
- MDPI
- Location
- Basel
- Date
- 2018
- Language
- German
- License
- CC BY-NC-ND 4.0
- ISBN
- 978-3-03897-106-1
- Size
- 17.0 x 24.4 cm
- Pages
- 216
- Keywords
- document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
- Category
- Informatik