Monday, April 1, 2019

Study of Document Layout Analysis Algorithms

need of account Layout Analysis algorithmic programsRelative Study of Document Layout Analysis algorithmic rules for Printed Document ImagesDivya Kamat, Divya Sharma, Parag Chitale, Prateek DasguptaABSTRACTIn the sp ar-time activity survey paper, the different algorithms that could be utilise for school schoolbookbook file layout analysis go been canvass and their results have been comp bed. For the removal of experience conceal, Bloombergs algorithm and CRLA have been described. For the purpose of text segmentation, we have canvas the Recursive XY Cut algorithm, RLSA and RLSO algorithms.IntroductionPhysical layout analysis of printed text file paradigms is the initiatory mistreat of the OCR conversion. For the OCR to work effectively, we need to provide an input wherein no considers ar present in the account i.e. the show contains only text. If this is not make properly, the OCR will return garbage values. To avoid this, we have discussed two algorithms, Bloomb ergs algorithmic rule and CRLA that could be utilise for the removal find outs from the register images.The next step is the text segmentation wherein we find the text blocks inside the catalogue. The coordinates of these text blocks argon thus passed as input to the OCR. To perform this segmentation, we have discussed the algorithmic XY fade algorithm, the RLSA and RLSO algorithms.Removal of Image from DocumentThe first step in the document layout analysis is to remove the images present in the original document. We will be discussing the Bloombergs algorithm along with its variations and the CRLA algorithm for image removal.Bloombergs AlgorithmThe Bloombergs algorithm is primarily used to find the image mask of photoengraving images. The implementation of this algorithm uses basic morphological acts. The algorithm has the following locomoteIn the first step, the binarization of the input image is performed.Next, 41 scepter reduction is performed twice use door T=1.41 brink reduction is performed victimization T=4.41 doorway reduction is performed using T=3.Opening the image with a structural element of surface 55.Next, 14 expansion of the image is performed twice.Next the union of overlapping components of the shed image obtained from step 6 with the image obtained from step 2 is performed. dilation with structural element 33 followed by 14 expansion which is performed twice.The halftone mask obtained from step 8 is hence subtracted from the binarized input image.The principal(prenominal) materialization with Bloombergs algorithm is that it is unable to distinguish between text and sketches (i.e. short letter drawings) in a printed document image.Enhanced CRLA AlgorithmCRLA stands for Constraint shed blood distance Algorithm. In this algorithm we apply horizontal and tumid smoo whereforeing to the document image to get a clear separation between text and images in the document.Enhanced CRLA is used to smooth out only the text part in the image and avoid smoothening of non-textual part of the document image.AlgorithmLabel the connected components in the document image.Classify the components with respect to their high as followsHeight slight than or equal to 1 cm, chase after it as 1Height between 1 and 3 cm, tick off it as 3Height greater than 3 cm, label it as 3Apply horizontal smoothening to the components with label 1 only.Apply straight smoothening to the components with label 1 only.Logically AND the two images obtained previously.Apply horizontal smoothening to the sidetrack image of AND operation. play lowly Black Run Length rate the Black Run Length (BRL) row-wise for the region under consideration.Maintain a Black-White Transition Count (TC) for the region.Calculate Mean BRL as MBRL= (BRL/TC).Calculate Mean Transition CountMaintain a Black-White Transition Count (TC) for the region.Calculate W, the width of the region.Calculate Mean TC as MTC=(TC/W)Extract the components from the image with labe l 1 having values of MBRL and MTC in the acceptable range for the natural document image.Apply horizontal smoothening to the components with label 2 only.Apply straight smoothening to the components with label 2 only.Logically AND the two images obtained previously.Apply horizontal smoothening to the sidetrack image of AND operation.Calculate MBRL and MTC.Extract the components from the image with label 2 and 3 having values MBRL and MTC in the acceptable range for the typical document image.At step 9 we extract the text part of the document image and at step 15 we extract the non-text part of the document image.The main advantage of the CRLA algorithm is that clear separation of text and non-text part of the document image. It also works for sketches as well as halftones effectively. It has considerably less(prenominal) complexity as selective smoothening is done.However, after the removal of the non-textual part of the document image, some stray pixels remain the image. The con nected components in the halftone image whose height is less than 1cm are assumed as text elements in the algorithm. This results in presence of unwanted components in the final image.Text varianceThe next step in the document layout analysis is the segmentation of text into text blocks that could be provided as input to the OCR. The following algorithms have been studied for thisRecursive XY Cut algorithmThe recursive XY put off algorithm is used for obtaining text blocks from an image that does not contain any images from the original printed document. The XY runway algorithm works in the following wayThe bounding boxes of the image are calculated.Next we calculate the horizontal and vertical projections of the image.After calculating the projections, we then perform X cuts on all the valleys in the horizontal projections which have a value greater than the brink th.Next we perform Y cuts in between these X cuts at all the valleys in the vertical projections which have a value greater than the threshold tv.We repeat the go 3 and 4 until there are no further X or Y cuts possible in a region.One of the problems with XY cut algorithm is that there is no method to find a threshold that will work for all the documents. Instead, a new threshold require to be fixed for severally document and this cannot be done without manual(a) intervention.An different major issue with the recursive XY algorithm is the time complexity. The recursive XY cut algorithm requires a large time to complete execution. disrespect these disadvantages, this algorithm successfully separates the text blocks provided that a manual threshold is provided.RLSAThe Run-Length Smoothing Algorithm (RLSA) works on black white scanned images of documents. It finds runs of white pixels and converts them into black pixels whenever they are less than a given threshold. The RLSA works in four stepsIn the first step, we perform horizontal smoothing. For this, we scan the image row-wise and then transpose lengths of white pixels by black pixels if they are less than a threshold th.In the second step, we perform vertical smoothing. For this, we scan the image column-wise and then replace lengths of white pixels by black pixels if they are less than a threshold tv.Next, we perform logical ANDing of the images obtained from the first and second steps. thusly we perform horizontal smoothing on the image obtained from step 3 with a threshold ta.RLSOA simplified version of the RLSA, RLSO (Run-Length Smoothing with OR) works as followsIn the first step, we perform horizontal smoothing. For this, we scan the image row-wise and then replace lengths of white pixels by black pixels if they are less than a threshold th.In the second step, we perform vertical smoothing. For this, we scan the image column-wise and then replace lengths of white pixels by black pixels if they are less than a threshold tv.Next we perform a logical OR operation on the images obtained from the first and secon d step.The RLSA algorithm returns rectangular frames of documents with Manhattan Layouts. On the other hand, RLSO algorithm also works well with non-Manhattan layouts. The problem with both RLSA and RLSO is that the threshold for smoothing needs to be determined manually. Also the threshold required for each document image is different and it is almost impossible to be determined manually.ConclusionWe have compared the above given algorithms for the document layout analysis. During our research we gear up that, while Bloombergs algorithm faces problems for images that contain sketches, CRLA faces problems for images that contain extremely baseborn non-textual elements.We also observed that the recursive XY Cut algorithm and RLSA both do not work on printed documents having non-Manhattan layouts. On the other hand, the RLSO algorithm gives comparatively better results for Manhattan as well as non-Manhattan layouts. However, all three algorithms mentioned above face the common prob lem of manual threshold determination which is document specific.ReferencesSyed Saqib Bukhari, Faisal Shafait and Thomas M. Bruel, Improved Document Image Segmentation Algorithm using Multiresolution MorphologyJaekyu Ha and Robert M. Haralick, Ihsin T. Philips, Recursive XY Cut using Bounding Boxes of Connected Components , three International Conference on Document Analysis and Recognition, ICDAR, 1995Stefano Ferilli, Teresa M.A. Basile, Floriana Esposito, A histogram-based proficiency for Automatic Threshold Assessment in a Run Length Smoothing-based Algorithm, ACM, 2010.Hung-Ming Sun, Enhanced Constrained Run-Length Algorithm for Complex Layout Document treat, International Journal of Applied Science and Engineering, 2006

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.