Seite - 135 - in Document Image Processing

Bild der Seite - 135 -

Text der Seite - 135 -

J. Imaging 2018,4, 15 partition. TheWeightedFiniteStateTransducers (WFST)decoding(seeSection3.5) canbedesignedto outputword, sub-wordorcharacter sequences. Foreachoutput type, the lexiconandlanguagemodel have tobemodiﬁedaccordingly,andnoadditionalmodiﬁcation isnecessary in thesystem. Preprocessing and feature extraction RecurrentNeuralNetwork x WordLexiconand LanguageModel WFSTdecoding o muerteepeormeresciaelporquantopassaraelmandami x1 x2 . . . x60 BLSTMlayer o1 . . . o106 Figure5. Bi-directionalLong-ShortTermMemory (BLSTM)systemarchitecture. TheBLSTMRNN outputsposteriordistributionsoateachtimestep. Thedecoding isperformedwithWeightedFinite StateTransducers (WFST)usinga lexiconanda languagemodelatwordlevel. 3.4.3.DeepModelsBasedonConvolutionalRecurrentNeuralNetworks The Convolutional Recurrent Neural Network (CRNN) [32] is inspired by the VGG16 architecture [33] that was developed for image recognition. We use a stack of 13 convolutional (3× 3ﬁlters, 1× 1 stride) layers followedby three bi-directional LSTMlayerswith 256units per layer (seeFigure6). EachLSTMunithasonecellwithenabledpeepholeconnections. Spatialpooling (max) isemployedafter someconvolutional layers. To introducenon-linearity, theRectiﬁedLinear Unit (ReLU) activation functionwas used after each convolution. It has the advantage of being resistant to the vanishinggradient problemwhile being simple in termsof computation andwas showntoworkbetter thansigmoidandhyperbolic tangentactivation functions [34].Asquare-shaped slidingwindow is used to scan the text-line image in the direction of thewriting. The height of thewindow is equal to theheight of the text-line image,whichhas beennormalized to 64pixels. Thewindowoverlap isequal to twopixels toallowcontinuous transitionof theconvolutionﬁlters. Foreachanalysiswindowof64×64pixels insize,16 featurevectorsareextractedfromthefeature mapsproducedby the last convolutional layerand fed into theobservationsequence. For eachof the16columnsof the last512 featuremaps, thecolumnsofaheightof twopixelsareconcatenated intoa featurevectorof size1024 (512×2). Thanks to theCTCtranscription layer [35], thesystemis end-to-endtrainable. TheconvolutionalﬁltersandtheLSTMunitsweightsare thus jointly learned usingtheback-propagationprocedure.Wecombinedtheforwardandbackwardoutputsat theend of the BLSTMstack [36] rather than after each BLSTM layer, in order to decrease the number of parameters.Wealsochosenot toaddadditional fully-connected layerssince,byaddingsuchlayers, thenetworkhadmoreparameters, convergedmoreslowlyandperformedworse.Hyperparameters suchas thenumberofconvolution layersandthenumberofBLSTMlayersweresetuponavalidation set. TheLSTMunitweightswere initializedasper themethodof [37],whichproved toworkwell andhelps thenetwork toconverge faster. Thisallows thenetwork tomaintainaconstantvariance across thenetwork layers,whichkeeps thesignal fromexplodingtoahighvalueorvanishingtozero. Theweightmatriceswere initializedwithauniformdistribution. 135

zurück zum Buch Document Image Processing"

Document Image Processing

Titel: Document Image Processing
Autoren: Ergina Kavallieratou; Laurence Likforman-Sulem
Herausgeber: MDPI
Ort: Basel
Datum: 2018
Sprache: deutsch
Lizenz: CC BY-NC-ND 4.0
ISBN: 978-3-03897-106-1
Abmessungen: 17.0 x 24.4 cm
Seiten: 216
Schlagwörter: document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
Kategorie: Informatik