Page - 131 - in Document Image Processing
Image of the Page - 131 -
Text of the Page - 131 -
J. Imaging 2018,4, 15
presented inFigures1and3. It is composedof853pages thatwereautomaticallydividedinto lines,
giving a total number of 20,356 lines. In the standard trainingpartition, the vocabulary size is of
about11,000wordswithasetof106characters (the105different characters thatappear in the text
of the trainingpartitionandoneextracharacter thatappears in the textof thevalidationpartition),
including10numbers, 72upperandlowercase letterswithandwithoutaccents, 5punctuationmarks,
1blankspaceand18special symbols. Thefirst15,010 linesarepubliclyavailableon thewebsiteof
the PatternRecognitionandHumanLanguageTechnology(PRHLT)researchcenter [20]. In thiswork,
weusedthispubliclyavailablepartition. Thefirst9000 lineswereusedfor training theopticaland
languagemodels, thenext1000 forvalidationandthe last5010 lines for testing.
Figure3.Page515of theRodrigodataset.
In theRodirgocorpus, therearemanyrarewordsandwords in theirarchaic formsyieldinga large
amountofOOVwords.Moreover, thiscorpuscontainsscarceOOVcharacters(suchas:\, p´, g¯, andw)
thatdonotbelongto the trainingset.OOVwordsgenerally includewords thatappear indistinct form
inthetrainingandtest sets (e.g.,portugalandportug¯l), abbreviationsandwordshyphenateddifferently
in the trainingandtest sets.
Table1presentsasummaryof the informationcontainedin thepartitionsof theRodrigocorpus
used in thiswork at the three lexical units studied: words, sub-words and characters. This table
presents for each lexicalunit the total amount, thevocabulary size (differentunits), theamountof
OOVunitsandtheoverlappingbetweentheOOVcontainedin thevalidationandtestpartitions, i.e.,
theamountofOOVunitscontainedin the testpartition thatarepresent in thevalidationpartition.
131
back to the
book Document Image Processing"
Document Image Processing
- Title
- Document Image Processing
- Authors
- Ergina Kavallieratou
- Laurence Likforman-Sulem
- Editor
- MDPI
- Location
- Basel
- Date
- 2018
- Language
- German
- License
- CC BY-NC-ND 4.0
- ISBN
- 978-3-03897-106-1
- Size
- 17.0 x 24.4 cm
- Pages
- 216
- Keywords
- document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
- Category
- Informatik