Page - 140 - in Document Image Processing
Image of the Page - 140 -
Text of the Page - 140 -
J. Imaging 2018,4, 15
100
101
102
103
Recognised OOV words Not recognised OOV words
Figure10.Distributionof theperplexitypresentedbythe10-gramcharacterLanguageModel (LM)per
recognizedandunrecognizedOOVwords(decomposedintocharactersequences)bytheHMMsystem.
Table 3. Features of the perplexity per OOV word recognized and unrecognized distributions
for theHMMcharacter-based 10-gramLM.Q1, Q2 andQ3 are respectively the 1th, 2nd and 3rd
quartile, IQR the interquartile range,Min. andMax. theminimumandmaximumvalues andSD
thestandarddeviation.
Distribution Q1 Q2 Q3 IQR Min. Max. SD
Recognized 6.64 9.22 12.57 5.94 3.26 46.05 5.37
Unrecognized 8.70 12.21 17.75 9.05 3.06 367.07 16.25
4.3. Studyof theEffect ofClosing theVocabularyandAdding theTranscriptionof theValidationSet for
Training theLM
After the adjustment of thedecodingparameterswith thevalidation set, the transcriptionof
thetextlinescontainedinthispartitioncanbeusedtotrainanimprovedLMthat,hopefully,willreduce
theamountofOOVwords.Moreover, theOOVwordscanbe includedin thevocabularyasunigrams
(closedvocabularyexperiments) toverify their influenceontherecognition. Theseconditionswere
experimentedfor thebest languagemodelsatwordandcharacter levels (3-gramfor thewordbased
systemand10-gramfor thecharacter-basedsystem). Given that thesub-wordapproachpresented
no significative difference in terms of WER, compared to the word-based system (see Table 2),
thisapproachwasnot tested in thisexperiment.
Figures 11–13 allow comparing the obtained results for the word-based system and
the character-based approach with open and closed vocabulary, with and without the use of
the validation sampleswhen training the LM (see Section 3.4). On the one hand, as can be seen
in Figures 11 and13, theuse of the validation set does not significantly improve theword-based
recognition in terms of WER or CER. However, this additional information is very useful in
thecharacter-basedapproach.Ascanbeobserved inFigure11,astatistically-significant improvement
in terms of CER is achieved (16.9%±0.3 instead of 17.6%±0.3). This improvement allows
increasing the OOV word recognition accuracy (see Figure 12). On the other side, although
closing thevocabularysignificantly improves the recognitionperformance, it is interesting tonote
thebeneficial effect of theuseof thevalidation samples in the character-basedapproach. It is also
interesting tonote inFigures 11and13 that the character-basedsystem, even in themoredifficult
case (“open-vocabulary”), outperforms, in terms ofCER, theword-based system in the best case
(“closed-vocabulary”). In theclosedvocabularyconditions, theword-basedsystemrecognizesmore
OOVwords than the character-based system, 34.7%±1.2 instead of 29.6%±1.1 (see Figure 12).
However, in the real-world case, i.e., the open-vocabulary conditions, the character-based system
performsbetter.
140
back to the
book Document Image Processing"
Document Image Processing
- Title
- Document Image Processing
- Authors
- Ergina Kavallieratou
- Laurence Likforman-Sulem
- Editor
- MDPI
- Location
- Basel
- Date
- 2018
- Language
- German
- License
- CC BY-NC-ND 4.0
- ISBN
- 978-3-03897-106-1
- Size
- 17.0 x 24.4 cm
- Pages
- 216
- Keywords
- document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
- Category
- Informatik