Page - 62 - in Document Image Processing
Image of the Page - 62 -
Text of the Page - 62 -
J. Imaging 2018,4, 6
trainedbytypewrittenArabicwords infivefontswithsize14pointsandlexiconsizeof252words.
Vector quantizationwas used tomap each feature vector to the closest symbol in the codebook.
Themultiple recognitionhypotheses (N-bestwordlattice)of that systemachieveda97.65%accuracy.
Also, theholisticapproachwassuccessfullyusedonthesubwordlevel.NasrollahiandEbrahimi [15]
presentedanapproachtoofflineOCRforprintedPersiansubwordsusingwaveletpacket transform.
Theproposedtechniqueextractedfont invariantandsize invariant features fromdifferentsubwords
of four fontsandthreesizesandcompressedthemusingPrincipalComponentAnalysis (PCA).When
testedonasubsetof2000wordsofprintedPersian textdocuments, that systemachievedanaccuracy
of97.9%.
In a later work [16], Slimane et al. organized the ICDAR2013 competition on multi-font
and multi-size digitally represented Arabic text. The main characteristic of the winner system,
SiemenssystemsubmittedbyMarc-PeterSchambachet al.,was theusingofa threehidden layers
neuralnetwork, that transformsatwo-dimensionalpixelplane intoasequenceofclassprobabilities.
thesystemhavebeenappliedonasubsetof theAPTIdataset [17]andmanagedtoachieveanaccuracy
over99%.
While theholisticapproachavoids thechallengingsegmentationtaskofArabiccursivescripts,
it still has another challengeofdealingwith large lexicon sizeofArabicwords. As thenumberof
words inthe lexicongrows, therecognitiontaskbecomesmorecomputationallyexpensive.Mostof the
previouslyproposedholisticbasedArabicOCRsystemstestedwithsmall sizevocabularies,but this is
notpractical forArabicasamorphologicallyrich languagewithahugevocabularysize.
In this paper,wepropose a computationally efficient holisticArabicOCRsystem for a large
vocabularysize. For thesakeofapracticalapproach,a lexiconreductiontechniquebasedonclustering
thesimilarshapewords isusedtominimize thewordrecognition time. Theproposedsystemutilizes
ahybridofseveralholistic features thatcombineglobalwordlevelDCT-basedfeaturesandlocalblock
basedfeatures.Usingthese typesof features, thesystemmanages toachieveOmni-fontperformance
with fontandsize independence.Also, thepresentedsystemhasaflexiblearchitecture for integrating
languagemodellingconstraintsbyusingasecondrescoringpass for the topn-bestwordhypotheses.
This rescoringoperationprovidedasignificantenhancement in therecognitionaccuracyof thesystem.
Therestof thepaper isorganizedas follows. Section2 includesadescriptionfor theproposedholistic
OCR system. The holistic DCT features used are described in Section 3. The developed lexicon
reductiontechnique is illustrated inSection4. Section5describes the languagerescoringprocessused
bythesystem. Section6presentssystemevaluationresultsandperformancecomparisonwithstateof
art commercialArabicOCRsystems. Thefinalconclusionsandprospects for futureworkare included
inSection7.
2. SystemDescription
ThedevelopedholisticOCRsystemconsistsof twomodules. Thefirstone is the trainingmodule
where the holistic features are extracted from the training set of theword images. The extracted
featuresareusedtobuild thesetof clustersof similarwordshapes. Thegeneratedwords’ clustersand
theirextractedfeaturesrepresent theknowledgebase that isusedin therecognitionphase. Thesecond
module is therecognitionmodule. In thatmodule,afterapplyingthepreprocessingoperationsonthe
input image, thedetectedtextblocksaresegmentedintolinesandwords. Thefeaturesareextractedfor
eachwordimagethenthewordclusterorbest-nclusters, thathave theminimumEuclideandistance
with the test imagevector,areassigned. Thegeneratedwordlist fromtheselectedcluster isusedto
constructawordlattice for thepossible recognitionhypothesesof thewhole line. Thiswordlattice
is rescoredusingn-gramlanguagemodel toget thebest recognitionhypothesis. Figure2showsthe
blockdiagramof theproposedholisticOCRsystem.
62
back to the
book Document Image Processing"
Document Image Processing
- Title
- Document Image Processing
- Authors
- Ergina Kavallieratou
- Laurence Likforman-Sulem
- Editor
- MDPI
- Location
- Basel
- Date
- 2018
- Language
- German
- License
- CC BY-NC-ND 4.0
- ISBN
- 978-3-03897-106-1
- Size
- 17.0 x 24.4 cm
- Pages
- 216
- Keywords
- document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
- Category
- Informatik