Page - 189 - in Document Image Processing

Image of the Page - 189 -

Text of the Page - 189 -

J. Imaging 2018,4, 32 identiﬁcation [6]. AVideoOCRsystem is generally composedof four stages: detection, tracking, extractionandrecognition. The twoﬁrst steps consist in locating text regions invideo framesand generating theboundingboxesof text linesasanoutput. Textextractionaimsatextracting textpixels andremovingbackgroundones. Therecognitiontaskconverts imageregions into text strings. In this work,wefocusespeciallyonthedetectionandrecognitionsteps. Figure1.ExampleofanArabicvideoframeincludingsceneandartiﬁcial texts (a).Decompositionof anArabicwordintocharacters (b). Compared to scanned documents, text detection and recognition in video frames is more challenging. Themajorchallengesare: • Text patterns variability: unknown font-size and font-family, different colors and alignment (even in thesameTVchannel). • Backgroundcomplexity: text-likeobjects invideoframes, suchas fences,bricksandsigns, canbe confusedwith textcharacters. • Videoquality: acquisitionconditions, compressionartifactsandlowresolution. All thesechallengesmaygiverise to failures invideotextdetection. Thepresentstudyfocuses on theArabic videoOCRproblem. This introducesmanyadditional challenges related toArabic script [7]. ComparedtoLatin, theArabic texthasspecial characteristics suchaspresenceofdiacritics, non-uniforminter/intra-worddistanceandcursivenessof thescript, i.e., charactersmayhaveupto fourshapesdependingontheirposition in theword(forexamples, seeFigure1b). Several techniques have been proposed in the conventional ﬁeld ofArabicOCR in scanned documents [7–10]. However, fewattemptshavebeenmadeon thedevelopment ofdetection and recognition systems for overlaid text inArabicnewsvideo [11–13]. These systemswere testedon privatedatasetswithdifferent evaluationprotocols andmetrics thatmakedirect comparisonand objective benchmarking rather impractical. For instance, in [11], the proposed text detectorwas evaluatedonaprivate set of 150video images. In [13], Yousﬁet al. evaluated their textdetection systemontwoprivate test setsof164and201videoframes. Therefore, theavailabilityofanannotated andpublicdataset isofkey importance for theArabicvideotextanalysiscommunity. In this paper, we present AcTiV 2.0 as an open Arabic-Text-in-Video dataset dedicated to benchmarkingandcomparisonofsystemsforArabictextdetection,trackingandrecognition.AcTiV2.0 isan importantextensionof theonepublishedinICDAR2015[14]. It includes189videoclipswith anaverage lengthof10minpersequence foraglobaldurationofabout31h. Thesevideosequences havebeencollected fromfourdifferentArabicnewschannelsduring theperiodbetweenOctober 2013andMarch2016. In thepresentwork, threevideoresolutionswerechosen:HD(HighDeﬁnition, 1920×1080),SD(StandardDeﬁnition,720×576)andSD(480×360). The latter resolutionconcerns videoclips thathavebeendownloadedfromtheofﬁcialYouTubechannelofTunisiaNat1TV. Thepaper is organizedas follows: In Section 2,wepresent relatedworkondatasets for text detection/recognitionproblems. Then,wepresent in Section 3 theAcTiV 2.0 dataset in terms of features, statisticsandannotations.Wedetail theevaluationprotocols inSection4andpresent the experimental results inSection5. InSection6,wedrawtheconclusionsanddiscuss futurework. 189

back to the book Document Image Processing"

Document Image Processing

Title: Document Image Processing
Authors: Ergina Kavallieratou; Laurence Likforman-Sulem
Editor: MDPI
Location: Basel
Date: 2018
Language: German
License: CC BY-NC-ND 4.0
ISBN: 978-3-03897-106-1
Size: 17.0 x 24.4 cm
Pages: 216
Keywords: document image processing, preprocessing, binarizationl, text-line segmentation, handwriting recognition, indic/arabic/asian script, OCR, Video OCR, word spotting, retrieval, document datasets, performance evaluation, document annotation tools
Category: Informatik