Web-Books
in the Austria-Forum
Austria-Forum
Web-Books
Informatik
Document Image Processing
Page - 68 -
  • User
  • Version
    • full version
    • text only version
  • Language
    • Deutsch - German
    • English

Page - 68 - in Document Image Processing

Image of the Page - 68 -

Image of the Page - 68 - in Document Image Processing

Text of the Page - 68 -

J. Imaging 2018,4, 6 Fromtheresults inTable4wecansee thatourArabicholisticOCRsystemachieved77.3%WRR forrecentbooksand47.8%WRRforoldbooks.Consideringthetop-10hypotheses, theWRRforrecent books increasedto87.7%andforoldbooks increasedto65.7%.Whenconsideringtop-20hypotheses, theWRRincreased to89%and69%for recent andoldbooks, respectively. Adataanalysis for the recognitionerrorsof thebooksdata sets revealedseveral reasons that contributed to the reduction of theWRR.Wefoundthat thisdatasets includedhighOutOfVocabulary (OOV)rateofaround6% forrecentbooksand7%foroldbooks. It isknownthat theeffectof theOOVisaccumulativewhich meansasingleOOVwordcanresult inrecognitionerrors formore thanoneof itsneighboringwords. Anotherphenomenonthatwenoticed in thesedataset is thehighrateofusingtheKashidacharacter, whichwas4%forrecentbooksand6%foroldbooks. TheKashidacharacter resulted inalteringthe shapesofsomecharacterswhichcausedsomewordrecognitionerrors.Also,wenoticedthatsome fontsof theoldbookshad largedifferences fromthe fontsused in training the systemsuchas the Anglo-fontwhichresulted invery lowWRRforsomepages. Whenweapplieda4-gramlanguagemodel rescoring for thebooksdatasetsusing the top-10 hypothesis,weachieved83%WRRfor therecentbookssetand53%WRRfor theoldbooksset.Wegot anabsolutegainof6%inWRRforbothof therecentandoldbooksdatasets. This result showthata highpercentageof thesystemrecognitionerrorscanbecorrectedusingthe top-nhypothesesanda languagemodel. In the fourth evaluation, we compared the performance of the proposed systemwith three commercial Arabic OCR systems, Sakhr, ABBYY and NovoDynamics, which represent the best performingArabicOCRpackagescurrentlyavailable. Table5showsthesecomparativeresults. Table5.Recognitionrate (percent)of recentcomputerizedanduncomputerizedbooks. EDstands for Euclideandistance. BooksType NovoDynamics Sakhr ABBYY Holistic (UsingTop15withLM)SquaredED/AbsoluteED Computerized 88.45 82.17 54.33 82.97/84.76 Uncomputerized 78.15 54.94 29.22 53.21/58.04 Theresults inTable5showthat,whileusingsquaredEuclideandistanceas thedistancemeasure, our systemmanaged to achieve better performance than two systems,ABBYYandSakhr, for the computerized books data set and achieved better performance than the ABBYY system for the uncomputerized books data set. Whenweused the absolute Euclideandistance, the recognition rate increasedfrom82.97%to84.76%forthecomputerizedbookssetandfrom53.21%to58.04%forthe uncomputerizedbooksset,andtheproposedsystemoutperformedSakhrandABBYYsystemsforboth of the twodatasets, althoughtheNovoDynamicssystemoutperfomstheproposedone.Oursystemis stillmuchfaster, aswewill see in thenextsection. Asheavycomputation isoneof themaindrawbacks for theholisticapproach,weevaluatedthe runtimespeedof thepresentedsystem.Table6showstheprocessingtimesof theproposedsystem beforeandafter lexical reductionversus thenumberof selectedwordclusters. Theseexperiments wererunonCore i72.8GHzmachinewithsingle threadexecution. Table6.ProcessingtimeofwordsearchandLMvs.wordscandidates. SelectedWords ProcessingTime(s/word) NoReduction 0.545 LexiconReduction(1cluster) 0.0005 LexiconReduction(5clusters) 0.0026 LexiconReduction(10clusters) 0.0051 68
back to the  book Document Image Processing"
Document Image Processing
Title
Document Image Processing
Authors
Ergina Kavallieratou
Laurence Likforman-Sulem
Editor
MDPI
Location
Basel
Date
2018
Language
German
License
CC BY-NC-ND 4.0
ISBN
978-3-03897-106-1
Size
17.0 x 24.4 cm
Pages
216
Keywords
document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
Category
Informatik
Web-Books
Library
Privacy
Imprint
Austria-Forum
Austria-Forum
Web-Books
Document Image Processing