Page - 135 - in Document Image Processing
Image of the Page - 135 -
Text of the Page - 135 -
J. Imaging 2018,4, 15
partition. TheWeightedFiniteStateTransducers (WFST)decoding(seeSection3.5) canbedesignedto
outputword, sub-wordorcharacter sequences. Foreachoutput type, the lexiconandlanguagemodel
have tobemodifiedaccordingly,andnoadditionalmodification isnecessary in thesystem.
Preprocessing
and feature
extraction RecurrentNeuralNetwork
x WordLexiconand
LanguageModel
WFSTdecoding
o
muerteepeormeresciaelporquantopassaraelmandami
x1
x2
. . .
x60
BLSTMlayer o1
. . .
o106
Figure5. Bi-directionalLong-ShortTermMemory (BLSTM)systemarchitecture. TheBLSTMRNN
outputsposteriordistributionsoateachtimestep. Thedecoding isperformedwithWeightedFinite
StateTransducers (WFST)usinga lexiconanda languagemodelatwordlevel.
3.4.3.DeepModelsBasedonConvolutionalRecurrentNeuralNetworks
The Convolutional Recurrent Neural Network (CRNN) [32] is inspired by the VGG16
architecture [33] that was developed for image recognition. We use a stack of 13 convolutional
(3× 3filters, 1× 1 stride) layers followedby three bi-directional LSTMlayerswith 256units per
layer (seeFigure6). EachLSTMunithasonecellwithenabledpeepholeconnections. Spatialpooling
(max) isemployedafter someconvolutional layers. To introducenon-linearity, theRectifiedLinear
Unit (ReLU) activation functionwas used after each convolution. It has the advantage of being
resistant to the vanishinggradient problemwhile being simple in termsof computation andwas
showntoworkbetter thansigmoidandhyperbolic tangentactivation functions [34].Asquare-shaped
slidingwindow is used to scan the text-line image in the direction of thewriting. The height of
thewindow is equal to theheight of the text-line image,whichhas beennormalized to 64pixels.
Thewindowoverlap isequal to twopixels toallowcontinuous transitionof theconvolutionfilters.
Foreachanalysiswindowof64×64pixels insize,16 featurevectorsareextractedfromthefeature
mapsproducedby the last convolutional layerand fed into theobservationsequence. For eachof
the16columnsof the last512 featuremaps, thecolumnsofaheightof twopixelsareconcatenated
intoa featurevectorof size1024 (512×2). Thanks to theCTCtranscription layer [35], thesystemis
end-to-endtrainable. TheconvolutionalfiltersandtheLSTMunitsweightsare thus jointly learned
usingtheback-propagationprocedure.Wecombinedtheforwardandbackwardoutputsat theend
of the BLSTMstack [36] rather than after each BLSTM layer, in order to decrease the number of
parameters.Wealsochosenot toaddadditional fully-connected layerssince,byaddingsuchlayers,
thenetworkhadmoreparameters, convergedmoreslowlyandperformedworse.Hyperparameters
suchas thenumberofconvolution layersandthenumberofBLSTMlayersweresetuponavalidation
set. TheLSTMunitweightswere initializedasper themethodof [37],whichproved toworkwell
andhelps thenetwork toconverge faster. Thisallows thenetwork tomaintainaconstantvariance
across thenetwork layers,whichkeeps thesignal fromexplodingtoahighvalueorvanishingtozero.
Theweightmatriceswere initializedwithauniformdistribution.
135
back to the
book Document Image Processing"
Document Image Processing
- Title
- Document Image Processing
- Authors
- Ergina Kavallieratou
- Laurence Likforman-Sulem
- Editor
- MDPI
- Location
- Basel
- Date
- 2018
- Language
- German
- License
- CC BY-NC-ND 4.0
- ISBN
- 978-3-03897-106-1
- Size
- 17.0 x 24.4 cm
- Pages
- 216
- Keywords
- document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
- Category
- Informatik