Seite - 200 - in Document Image Processing

Bild der Seite - 200 -

Text der Seite - 200 -

J. Imaging 2018,4, 32 textline formationmethod (see [46] formoredetails). ThesecondstageusesCAEtoautomatically produce features, insteadofhard-codingthem.These featureshavebeen learnedinanunsupervised wayfromthe textlinecandidatesobtained in theﬁrst stage. Then, todiscriminate textobjects from non-textones,anSVMclassiﬁerwithRBFkernel is trainedonthepatchesextractedfromthetextline candidatesbyusingthegeneratedCAEfeatures. Note that thewholealgorithmisperformedtwice (foreach image) tohandlebothdark-on-light and light-on-dark texts, once along the gradient direction and once along the inverse direction. Theresultsof twopassesarecombinedtomakeﬁnaldecisions. Figure12.Pipelineof thetextdetectionalgorithm.Twopassesareperformed,oneforeachtextpolarity (DarktextonLightbackgroundorLight textonDarkbackground). 5.2. SIDOCR TheSIDOCRsystem[51] reliesspeciﬁcallyonaMulti-DimensionalLongShortTermMemory (MDLSTM)withaCTCoutput layer. Theproposednetwork is composedof three levels: an input layer,ﬁvehiddenlayersandanoutput layer. ThehiddenlayersareMDLSTMthat respectivelyhave 2,10,and50cellsandseparatedbyfeedforwardlayerswith6and20cells. In fact,wehavecreated ahierarchical structurebyrepeatedlycomposingMDLSTMlayerswith feedforward layers. Firstly, the image isdividedintosmallpatchesusingapixelwindowcalledthe“inputblock”,eachofwhich ispresentedto theﬁrstMDLSTMlayerasa featurevectorofpixel intensities. Thesevectorsare then scannedbyfourMDLSTMlayers indifferentdirections (i.e.,up,down, leftanright). After that, thecellsactivationof theMDLSTMlayersaresequentially fedto theﬁrstandsecond feed-forward layers through sub-samplewindows, namely “hidden block”. This can be seen as a subsampling stepwith trainableweights, inwhich the activationare summedandsquashedby thehyperbolic tangent (tanh) function. This step aims to extremely reduce thenumber ofweight connectionsbetweenhiddenlayers. Theﬁnal level is theCTCoutput layerwhich labels the sequencesof textlines. This layerhas ncells,wheren is thenumberof classes, inourcase165 (164charactersandonecell for the ‘blank’ output). Theoutputactivationsarenormalizedateachtimestepwith thesoftmaxactivationfunction. Theuse of such layer allowsworking onunsegmented input sequence,which is not the case for standardRNNobjective functions.AseparatenetworkhasbeentrainedforeachTVchannelof the referenceprotocol.All input imageshavebeenscaledtocommonheights (70pixels)andconverted togray-scale. Thetraining iscarriedoutwithback-propagationthroughtime(BPTT)algorithmand steepset optimizerhasbeenusedwitha learning rate of 10−4 andwithamomentumvalueof 0.9. Weperformedseveralexperimentstoﬁndtheoptimalsizesof theMDLSTMlayers, feedforwardlayers, inputblockandhiddenblock. Table6summarizes thebestobtainedvaluesof thenetworkparameters. 200

zurück zum Buch Document Image Processing"

Document Image Processing

Titel: Document Image Processing
Autoren: Ergina Kavallieratou; Laurence Likforman-Sulem
Herausgeber: MDPI
Ort: Basel
Datum: 2018
Sprache: deutsch
Lizenz: CC BY-NC-ND 4.0
ISBN: 978-3-03897-106-1
Abmessungen: 17.0 x 24.4 cm
Seiten: 216
Schlagwörter: document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
Kategorie: Informatik