Page - 200 - in Document Image Processing

Image of the Page - 200 -

Text of the Page - 200 -

J. Imaging 2018,4, 32 textline formationmethod (see [46] formoredetails). ThesecondstageusesCAEtoautomatically produce features, insteadofhard-codingthem.These featureshavebeen learnedinanunsupervised wayfromthe textlinecandidatesobtained in theﬁrst stage. Then, todiscriminate textobjects from non-textones,anSVMclassiﬁerwithRBFkernel is trainedonthepatchesextractedfromthetextline candidatesbyusingthegeneratedCAEfeatures. Note that thewholealgorithmisperformedtwice (foreach image) tohandlebothdark-on-light and light-on-dark texts, once along the gradient direction and once along the inverse direction. Theresultsof twopassesarecombinedtomakeﬁnaldecisions. Figure12.Pipelineof thetextdetectionalgorithm.Twopassesareperformed,oneforeachtextpolarity (DarktextonLightbackgroundorLight textonDarkbackground). 5.2. SIDOCR TheSIDOCRsystem[51] reliesspeciﬁcallyonaMulti-DimensionalLongShortTermMemory (MDLSTM)withaCTCoutput layer. Theproposednetwork is composedof three levels: an input layer,ﬁvehiddenlayersandanoutput layer. ThehiddenlayersareMDLSTMthat respectivelyhave 2,10,and50cellsandseparatedbyfeedforwardlayerswith6and20cells. In fact,wehavecreated ahierarchical structurebyrepeatedlycomposingMDLSTMlayerswith feedforward layers. Firstly, the image isdividedintosmallpatchesusingapixelwindowcalledthe“inputblock”,eachofwhich ispresentedto theﬁrstMDLSTMlayerasa featurevectorofpixel intensities. Thesevectorsare then scannedbyfourMDLSTMlayers indifferentdirections (i.e.,up,down, leftanright). After that, thecellsactivationof theMDLSTMlayersaresequentially fedto theﬁrstandsecond feed-forward layers through sub-samplewindows, namely “hidden block”. This can be seen as a subsampling stepwith trainableweights, inwhich the activationare summedandsquashedby thehyperbolic tangent (tanh) function. This step aims to extremely reduce thenumber ofweight connectionsbetweenhiddenlayers. Theﬁnal level is theCTCoutput layerwhich labels the sequencesof textlines. This layerhas ncells,wheren is thenumberof classes, inourcase165 (164charactersandonecell for the ‘blank’ output). Theoutputactivationsarenormalizedateachtimestepwith thesoftmaxactivationfunction. Theuse of such layer allowsworking onunsegmented input sequence,which is not the case for standardRNNobjective functions.AseparatenetworkhasbeentrainedforeachTVchannelof the referenceprotocol.All input imageshavebeenscaledtocommonheights (70pixels)andconverted togray-scale. Thetraining iscarriedoutwithback-propagationthroughtime(BPTT)algorithmand steepset optimizerhasbeenusedwitha learning rate of 10−4 andwithamomentumvalueof 0.9. Weperformedseveralexperimentstoﬁndtheoptimalsizesof theMDLSTMlayers, feedforwardlayers, inputblockandhiddenblock. Table6summarizes thebestobtainedvaluesof thenetworkparameters. 200

back to the book Document Image Processing"

Document Image Processing

Title: Document Image Processing
Authors: Ergina Kavallieratou; Laurence Likforman-Sulem
Editor: MDPI
Location: Basel
Date: 2018
Language: German
License: CC BY-NC-ND 4.0
ISBN: 978-3-03897-106-1
Size: 17.0 x 24.4 cm
Pages: 216
Keywords: document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
Category: Informatik