Web-Books
im Austria-Forum
Austria-Forum
Web-Books
Informatik
Document Image Processing
Seite - 131 -
  • Benutzer
  • Version
    • Vollversion
    • Textversion
  • Sprache
    • Deutsch
    • English - Englisch

Seite - 131 - in Document Image Processing

Bild der Seite - 131 -

Bild der Seite - 131 - in Document Image Processing

Text der Seite - 131 -

J. Imaging 2018,4, 15 presented inFigures1and3. It is composedof853pages thatwereautomaticallydividedinto lines, giving a total number of 20,356 lines. In the standard trainingpartition, the vocabulary size is of about11,000wordswithasetof106characters (the105different characters thatappear in the text of the trainingpartitionandoneextracharacter thatappears in the textof thevalidationpartition), including10numbers, 72upperandlowercase letterswithandwithoutaccents, 5punctuationmarks, 1blankspaceand18special symbols. Thefirst15,010 linesarepubliclyavailableon thewebsiteof the PatternRecognitionandHumanLanguageTechnology(PRHLT)researchcenter [20]. In thiswork, weusedthispubliclyavailablepartition. Thefirst9000 lineswereusedfor training theopticaland languagemodels, thenext1000 forvalidationandthe last5010 lines for testing. Figure3.Page515of theRodrigodataset. In theRodirgocorpus, therearemanyrarewordsandwords in theirarchaic formsyieldinga large amountofOOVwords.Moreover, thiscorpuscontainsscarceOOVcharacters(suchas:\, p´, g¯, andw) thatdonotbelongto the trainingset.OOVwordsgenerally includewords thatappear indistinct form inthetrainingandtest sets (e.g.,portugalandportug¯l), abbreviationsandwordshyphenateddifferently in the trainingandtest sets. Table1presentsasummaryof the informationcontainedin thepartitionsof theRodrigocorpus used in thiswork at the three lexical units studied: words, sub-words and characters. This table presents for each lexicalunit the total amount, thevocabulary size (differentunits), theamountof OOVunitsandtheoverlappingbetweentheOOVcontainedin thevalidationandtestpartitions, i.e., theamountofOOVunitscontainedin the testpartition thatarepresent in thevalidationpartition. 131
zurück zum  Buch Document Image Processing"
Document Image Processing
Titel
Document Image Processing
Autoren
Ergina Kavallieratou
Laurence Likforman-Sulem
Herausgeber
MDPI
Ort
Basel
Datum
2018
Sprache
deutsch
Lizenz
CC BY-NC-ND 4.0
ISBN
978-3-03897-106-1
Abmessungen
17.0 x 24.4 cm
Seiten
216
Schlagwörter
document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
Kategorie
Informatik
Web-Books
Bibliothek
Datenschutz
Impressum
Austria-Forum
Austria-Forum
Web-Books
Document Image Processing