Page - 200 - in Document Image Processing
Image of the Page - 200 -
Text of the Page - 200 -
J. Imaging 2018,4, 32
textline formationmethod (see [46] formoredetails). ThesecondstageusesCAEtoautomatically
produce features, insteadofhard-codingthem.These featureshavebeen learnedinanunsupervised
wayfromthe textlinecandidatesobtained in thefirst stage. Then, todiscriminate textobjects from
non-textones,anSVMclassifierwithRBFkernel is trainedonthepatchesextractedfromthetextline
candidatesbyusingthegeneratedCAEfeatures.
Note that thewholealgorithmisperformedtwice (foreach image) tohandlebothdark-on-light
and light-on-dark texts, once along the gradient direction and once along the inverse direction.
Theresultsof twopassesarecombinedtomakefinaldecisions.
Figure12.Pipelineof thetextdetectionalgorithm.Twopassesareperformed,oneforeachtextpolarity
(DarktextonLightbackgroundorLight textonDarkbackground).
5.2. SIDOCR
TheSIDOCRsystem[51] reliesspecificallyonaMulti-DimensionalLongShortTermMemory
(MDLSTM)withaCTCoutput layer. Theproposednetwork is composedof three levels: an input
layer,fivehiddenlayersandanoutput layer. ThehiddenlayersareMDLSTMthat respectivelyhave
2,10,and50cellsandseparatedbyfeedforwardlayerswith6and20cells. In fact,wehavecreated
ahierarchical structurebyrepeatedlycomposingMDLSTMlayerswith feedforward layers. Firstly,
the image isdividedintosmallpatchesusingapixelwindowcalledthe“inputblock”,eachofwhich
ispresentedto thefirstMDLSTMlayerasa featurevectorofpixel intensities. Thesevectorsare then
scannedbyfourMDLSTMlayers indifferentdirections (i.e.,up,down, leftanright).
After that, thecellsactivationof theMDLSTMlayersaresequentially fedto thefirstandsecond
feed-forward layers through sub-samplewindows, namely “hidden block”. This can be seen as
a subsampling stepwith trainableweights, inwhich the activationare summedandsquashedby
thehyperbolic tangent (tanh) function. This step aims to extremely reduce thenumber ofweight
connectionsbetweenhiddenlayers.
Thefinal level is theCTCoutput layerwhich labels the sequencesof textlines. This layerhas
ncells,wheren is thenumberof classes, inourcase165 (164charactersandonecell for the ‘blank’
output). Theoutputactivationsarenormalizedateachtimestepwith thesoftmaxactivationfunction.
Theuse of such layer allowsworking onunsegmented input sequence,which is not the case for
standardRNNobjective functions.AseparatenetworkhasbeentrainedforeachTVchannelof the
referenceprotocol.All input imageshavebeenscaledtocommonheights (70pixels)andconverted
togray-scale. Thetraining iscarriedoutwithback-propagationthroughtime(BPTT)algorithmand
steepset optimizerhasbeenusedwitha learning rate of 10−4 andwithamomentumvalueof 0.9.
Weperformedseveralexperimentstofindtheoptimalsizesof theMDLSTMlayers, feedforwardlayers,
inputblockandhiddenblock. Table6summarizes thebestobtainedvaluesof thenetworkparameters.
200
back to the
book Document Image Processing"
Document Image Processing
- Title
- Document Image Processing
- Authors
- Ergina Kavallieratou
- Laurence Likforman-Sulem
- Editor
- MDPI
- Location
- Basel
- Date
- 2018
- Language
- German
- License
- CC BY-NC-ND 4.0
- ISBN
- 978-3-03897-106-1
- Size
- 17.0 x 24.4 cm
- Pages
- 216
- Keywords
- document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
- Category
- Informatik