Web-Books
in the Austria-Forum
Austria-Forum
Web-Books
Informatik
Document Image Processing
Page - 140 -
  • User
  • Version
    • full version
    • text only version
  • Language
    • Deutsch - German
    • English

Page - 140 - in Document Image Processing

Image of the Page - 140 -

Image of the Page - 140 - in Document Image Processing

Text of the Page - 140 -

J. Imaging 2018,4, 15 100 101 102 103 Recognised OOV words Not recognised OOV words Figure10.Distributionof theperplexitypresentedbythe10-gramcharacterLanguageModel (LM)per recognizedandunrecognizedOOVwords(decomposedintocharactersequences)bytheHMMsystem. Table 3. Features of the perplexity per OOV word recognized and unrecognized distributions for theHMMcharacter-based 10-gramLM.Q1, Q2 andQ3 are respectively the 1th, 2nd and 3rd quartile, IQR the interquartile range,Min. andMax. theminimumandmaximumvalues andSD thestandarddeviation. Distribution Q1 Q2 Q3 IQR Min. Max. SD Recognized 6.64 9.22 12.57 5.94 3.26 46.05 5.37 Unrecognized 8.70 12.21 17.75 9.05 3.06 367.07 16.25 4.3. Studyof theEffect ofClosing theVocabularyandAdding theTranscriptionof theValidationSet for Training theLM After the adjustment of thedecodingparameterswith thevalidation set, the transcriptionof thetextlinescontainedinthispartitioncanbeusedtotrainanimprovedLMthat,hopefully,willreduce theamountofOOVwords.Moreover, theOOVwordscanbe includedin thevocabularyasunigrams (closedvocabularyexperiments) toverify their influenceontherecognition. Theseconditionswere experimentedfor thebest languagemodelsatwordandcharacter levels (3-gramfor thewordbased systemand10-gramfor thecharacter-basedsystem). Given that thesub-wordapproachpresented no significative difference in terms of WER, compared to the word-based system (see Table 2), thisapproachwasnot tested in thisexperiment. Figures 11–13 allow comparing the obtained results for the word-based system and the character-based approach with open and closed vocabulary, with and without the use of the validation sampleswhen training the LM (see Section 3.4). On the one hand, as can be seen in Figures 11 and13, theuse of the validation set does not significantly improve theword-based recognition in terms of WER or CER. However, this additional information is very useful in thecharacter-basedapproach.Ascanbeobserved inFigure11,astatistically-significant improvement in terms of CER is achieved (16.9%±0.3 instead of 17.6%±0.3). This improvement allows increasing the OOV word recognition accuracy (see Figure 12). On the other side, although closing thevocabularysignificantly improves the recognitionperformance, it is interesting tonote thebeneficial effect of theuseof thevalidation samples in the character-basedapproach. It is also interesting tonote inFigures 11and13 that the character-basedsystem, even in themoredifficult case (“open-vocabulary”), outperforms, in terms ofCER, theword-based system in the best case (“closed-vocabulary”). In theclosedvocabularyconditions, theword-basedsystemrecognizesmore OOVwords than the character-based system, 34.7%±1.2 instead of 29.6%±1.1 (see Figure 12). However, in the real-world case, i.e., the open-vocabulary conditions, the character-based system performsbetter. 140
back to the  book Document Image Processing"
Document Image Processing
Title
Document Image Processing
Authors
Ergina Kavallieratou
Laurence Likforman-Sulem
Editor
MDPI
Location
Basel
Date
2018
Language
German
License
CC BY-NC-ND 4.0
ISBN
978-3-03897-106-1
Size
17.0 x 24.4 cm
Pages
216
Keywords
document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
Category
Informatik
Web-Books
Library
Privacy
Imprint
Austria-Forum
Austria-Forum
Web-Books
Document Image Processing