A model guided document image analysis scheme

(1)

A Model Guided Document Image Analysis Scheme

Gaurav Harit Santanu Chaudhury P. Gupta N. Vohra S. D. Joshi

g-harit@hotmail.com santanuc@ee.iitd.ac.in gupta@cfar.umd.edu nvohra@hotmail.com sdjoshi@ee.iitd.ac.in

Department of Electrical Engineering Indian Institute of Technology, Delhi

New Delhi. India

Abstract

This paper presents U new model based document im- age segmentation scheme that uses XML-DTDs (extensi- ble Mark-up Language-Document Type Definition). Given a document image, the algorithm has the ability to select the appropriate model. A new wavelet based tool has been designed for distinguishing text from non-text regions and characterization of font sizes. Our model based analysis scheme makes use of this tool for identifying the logical components of a document image.

keywords: segmentation, layout analysis, XML-DTD

1 Introduction

Analysis and interpretation of document images is im- portant for providing electronic access to legacy documents.

In this paper, a new model based document image segmentation scheme has been presented.

An image of a document page needs to be segmented into logical components for subsequent machine process- ing. A well known approach for page segmentation is con- nected component aggregation [2]. Tan et a1[5] have used a pyramid structure or a multi-resolution connected component analysis of character strings to separate text from graphics in a document page. But the primary limitation of the existing techniques [11, [2], [3], [4], [5], [6] is that they generally assume a predefined layout of the document page or about the shape and geometry of the text and image regions. Use of standard XML-DTD to model the structure of electronic text documents has been done by [113. How- ever, no published work could be cited which uses DTDs to model the logical structure and layout of document images.

In this paper, we have proposed a model based document image segmentation scheme using DTDs for modeling pages. Given an image, the system has the ability to select the appropriate model on the fly. So, the present approach is

not constrained by the a priori assumption about the model.

This model is used for refining the result of a model free bottom-up segmentation scheme. A new wavelet based tool has been designed for distinguishing between text and non- text regions and characterization of font sizes. Our model based analysis scheme makes use of this tool for identifying logical components of a document page. Experimental results have established effectiveness of our approach.

2 Document Model for Layout Representa- tion

The image segmentation scheme proposed in this paper is guided by a XML-DTD model which depicts the layout structure of the document. For example, components of a document page can be specified in the element type decla- ration. Additional information like alignment (horizontal, vertical or random) of child elements within the parent element, content of element (image or text), font size of text, presence of embedded images in text regions, language of the document page etc can be specified in the attribute list declarations. As opposed to examples of document class where the page has got a fixed layout (i.e. the relative position of every element is fixed), there are document cate- gories where the layout of the page cannot be ascertained just by reading the DTD. In such cases the DTD can de- fine rules for grouping the child elements to form a parent element. For example, a newspaper page has one or more Article elements placed in a random way. The rule for iden- tifying an Article element is that there should be a Head- ing element which has text of larger font size followed by Body element which has text of smaller font size. Obey- ing this rule, the segmentation module can try to group together text blocks of large font size followed by text blocks of small font size (aligned horizontally) to form an Arti- cle element. Several Article elements thus found will be grouped together to form the root element i.e. the entire newspaper document page.

(2)

When a new image is to be segmented. we need to select the hest niatching DTD model to guide the top-down segmentation process. This model is selected using layout features.

3 Selection of DTD model for a Document Image

Whvlet based features have been used in this scheme for layout based siiniIarity determination aniong docunient pages. Present approach has been motivated by [IO]. Since.

Daubcchies'-8 wavelet has been found to he useful for establishing similarity between images in [IO] we have adopted a similar approach.

A document image is rcscaled to a size 5I2 x 5I2 for the purpose of layout feature extraction. Steps involved in feature extraction arc foIlowing

• Compute a S-level DWT of the document-image. Store the lowest 32 x 32 sub-image MI as the lirst component of the feature vector.

• After 5 levels of multi-resolution analysis. the si/e of each of the sub-images is I6 x 16. Compute the stan- dard deviation (0) of the lowest 16 x 16 DWT coef- ticients. This is the second component of the feature vector.

• Compute the 6rh-level DWT as well and then store the lowest I 6 x I 6 DWT matrix (M2) as the third component of the feature vector.

In this way. we have extracted the layout characteristics of the page in the feature vector F givcn ;IS F = [Af1. a, M2]

The algorithm for layout based model retrieval selects images from the database similar to the query image using experimentally determined thresholds for proxiinity of the corresponding a values and also the corresponding matrices MI andM2.

DTD niodel of the majority of the database samples which pass through the 3 stage comparison process for T is chosen as the appropriate model for the given document.

For successful operation of this scheme. we need to have a large database with appropriate samples and DTD model of those samples.

4 The Segmentation Algorithm

The segmentation scheme proposed in this paper involves combination of the bottom-up processing with model based verification and refinement. Overall scheme involves following steps:

1. Bottom-up model free segmentation using white space information.

2. Analysis of the scgiiients lor identitication of textual and image blocks.

3. Model guided identilication and separation of logical components.

4.1 Bottom-Up Segmentation

The bottom-up segmentation scheiiie aims at linding clusters of sinaller entities using white sp;iccs surrounding their constellution using appropriate heuristics. Runs of white spaces greater than a threshold (100 pixels) arc idcn- tilied. These runs of white spaces :.ire then thinned. These thinned lines forin the separator. Separators having over- lapping extent are merged. The smie steps ;ire followed to detect vertical separators. These superators form a mesh.

The next step is to identify the blocks out of this mesh.

The algorithm [91 cleveloped for identifying rectangular blocks can hc hriclly outlined in thc following steps :

1. Mark all the crossings of horizontal and vertical lines.

2. Pick up the closest possible pair of vertical crossings on a given horimntul line satisfying the eligibility cri- terion that trt lctrst one member of the pair has not been used to forrii any other block. It none of the meinhers is found to be eligible. pick up the next horizontal line und test for the criterion.

3. Try to trwcrsc a smallest possible rcctangle in the anti- clockwise direction with the two vcrtic;iI crossings ;is edges. 11' all possibilities of traversing a rectangle fail.

go to step 2. If a rcctanglc is fornicd. note the block dimensions. inark those vertical crossinps as used and go to step 1.

4. If there is any Icl't over region which has not been covered by any block. form new rectangular blocks to cover it.

The bottom-up segmentation algorithm has been found to work quite well with document images of high textual densities. For others. we may get extra separators. Content hascd block merging scheme takes carc of this error in the top-down analysis process.

4.2 Content characterization of Document Image regions

The ability of wavelets to encode thc energy distribution over different spatial scales has been used for content char- acterization. A Sccilog'zrin represents the variation of en- ergy of the transformed image with scale (resolution). The energy associated with each scale can be estimated by the following expression e, = CL. d] k where ej is the energy atjth resolution and djs are the DWT coefficients at the res- olution (scale) j .

(3)

•"I X

',02

Scalogram (Text) Scalogram (Non-Text) Scalogram (Non-Text)

Figure 1. Scalograms for Text and Non-Text Regions

4.2.1 Identification of text and non-text regions In the case of text regions, the characteristics of the struc- ture of text alphabets become most apparent at a certain scale. Therefore, the scalogram shows a distinct peak at a certain scale. However, for images or pictures (non-text), the scalogram may or may not peak specifically at only one scale. This is so because, depending upon the distribution of gray scales over the image, the details contained in the image (information) may become apparent at any arbitrary scale or at more than one scales. Figure 1 compares the shapes of the scalograms for text and non-text document pages. In general, the scalogram of an image is much more flat as compared to that for text. Moreover, the scalogram of text is quite skewed towards its predominant scale, un- like the scalogram of an image. It has been experimentally found that for text regions, the third order central moment of the scalogram is negative but for images it is random. The value of fourth order central moment has been found to be less than a threshold value of 30 for text and greater than 30 for non-text. These scalogram based criteria form the basis of our approach classifying a document image block as text or non-text.

4.2.2 Characterization of text regions

A text region generally has white spaces surrounding the text blocks. After the surrounding white spaces have been removed from the text regions, their font-size based segmentation is done using a scalogram based technique. It has been shown in [12], that the energy of the scaled image at the i th scale is related to the original energy value of the signal by a magnitude scaling and peak translation. It has been also found experimentally that the scalogram of text of smaller font size peaks at a scale higher than that of text of larger font size. Such a difference in the location of the peak of the scalogram has been used for the characterization of font-sizes of text regions.

The algorithm used involves a Region Growing ap- proach. Starting with a small window size, the DWT and hence the scalogram peak is computed for the portion of the image spanned by this window. The window is grown at each step by a fixed fraction. This process is continued till

the value of the peak scale remains stable. This is an indi- cation that the window contains text belonging to the same font-size. On continuing this process further, the peak scale value may get destabilized. At this moment, we can infer that the window now includes some text that does not be- long to the previous font-size. So, a region boundary can be identified. Now, the peak is found for the base window starting from the place where the last boundary was detected.

This algorithm can segment a textual region into logical components of varying font sizes without using arbitrary thresholds on dimensions of connected component or height of the lines [11.

4.3 Top-Down Segmentation : Identification of the elements defined by the DTD

Given the content-model of the parent element and the parent block dimensions, the following strategy is used to identify the child element blocks within it using the information provided in the DTD.

In case of horizontal alignment of child elements in the parent element, the child elements are identified in order from top to bottom. The block which has to be identified will have only one unknown edge, the bottom-most since the leftmost edge and the rightmost edge follow that of the parent block and the topmost edge follows either the parent block's topmost edge or the previously identified child block's bottom-most edge. The goal is to locate the unknown edge which is the dividing line between two types of child elements whose content informations are there in the DTD.

Depending on the content information of the elements on either side of the unknown edge (line), it is possible to pre- dict the expected region where the horizontal dividing line is to be searched. For example, if the first region is having text-only content and the other region is having images with text aligned on the right side (this case arises when the dividing line between ContentLabe1 element and Contents element is to be detected in the Spectrum magazine content page), a search is made for all the images, with text on the right side, in the unlabelled region of the parent block and the top edge of the topmost image is considered as the lower bound of the expected region. Since the first region

(4)

is having text-only content, all the blocks within the parent block whose top edge touches the top edge of the first element are identified and the lower-most edge of all these blocks is considered as the top bound of the expected re- gion. Similar heuristics have been designed to locate the expected region in other cases by exploiting the distinguish- ing features about the contents e.g. presence of image(s), image alignment, font sizes of texts, relative placement of text blocks of different font sizes etc.

Once the expected region has been identified, all the hor- izontal lines (possible dividing lines) which run across the entire parent block are searched. Each such line is then sub- jected to a confirmation test, in which the blocks lying on either side of the candidate line are checked for their contents and relative positions to confirm if they comply with the DTD specifications. If a candidate line fails this test, the next line is considered. If all the lines fail the test, the line which complies best with DTD specifications is selected.

The selected line's position is taken as the lower edge of the child element. A similar procedure is followed to identify the regions of other child elements. The last element in order gets whatever place is left for it. If dividing lines are not obtained in the expected region, the longest line can be considered as a candidate dividing line. An analogous procedure is used to detect the child elements if the alignment is vertical.

In the case of random alignment of the child elements within the parent element, the layout of the child elements is not known, but the DTD specifies the contents of the child elements. Based on this content information, detection and grouping of blocks which comply with the DTD information is carried out.

5 Results and Discussions

Before obtaining the layout features, the document image is binarized so that the wavelet analysis is not affected by the background texture which depends on the quality of the document paper. Figures 2 through 4 illustrate the segmentation details and the element blocks identified. The original document image is shown in 2 (a). This grey-scale image is binarized with a threshold value of 80 (on a 0 to 255 scale). On the binary image, white space runs of length greater than 100 pixels are detected in horizontal and vertical directions. These white space runs are marked as black lines on a white background. Parts (b) and (c) show these black lines (on a white background) in the vertical and horizontal directions respectively. These black lines form clusters which appear to be thick or thin depending on the num- ber of neighboring black lines in them. Thin clusters (<=3 pixels) generally correspond to white spaces in between text lines and are ignored. Thick clusters more closely approxi- mate the white separating boundaries between semantically

different blocks. These clusters are now thinned to obtain boundary lines of the blocks. The thinned horizontal lines and vertical lines are shown in (d) and (e). The mesh or blocks formed by these lines are :shown in (f). Using the block seeking algorithm, the blocks are identified and then subjected to the imagekext identification and font size characterization modules which finally yield the results shown in (g). The image regions are shown in black. The text regions are shown in various gray shades with the lighter shades depicting larger fonts and darker shades depicting smaller fonts. Several font sizes can be identified in a given block. The font sizes within a block are averaged and each block is assigned only one font size as shown in (h). Now the DTD comes into play at the final step of identification and labelling of the blocks. The strategy and heuristics used have been described in the previous section. The blocks identified and labelled are shown in the next figure 4. Over- all we have got about 90% correct segmentation and identification results for documents of different languages.

6 Conclusions

In this paper we have proposed a XML-DTD guided analysis scheme for a document image which is novel in (a) its ability to select the appropriate model, (b) the use of a new wavelet based tool for distinguishing text from non-text regions and characterization of font sizes, and (c) the use of a combination of bottom-up and top-down segmentation al- gorithms. The scheme has produced good results. This can be used for automatic conversion of document images into XML documents to facilitate web access and query processing.

References

[1] Ani1 K. Jain and Yu Zhong, Page Segmentation Using Tex- ture Analysis, Pattern Recognition, Vol. 29, No.5, pp. 743- 770, 1996.

[2] Antaonacopulos A., 'Page Segmentation Using the Descrip- tion of the Background', Computer Vision and Image Under- standing, Vol. 70, No. 3, June, pp. 350-369, 1998

[3] Jain A.K., Yu Bin, 'Document Representation and its Appli- cation to Page Decomposition', IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 20, No. 3, March

1998

[4] Feng L., Tang Y.Y., Yang Y.L., 'A Wavelet Based Approach for Extracting Contours of Document Images', International Conference on Document Analysis and Recognition 1999, Bangalore, pp.71-74

[5] Tan Chew Lin, Yuan Bo, Huang W., Zhang Zheng, 'TexUGraphics Separation using Pyramid Operations', Inter- national Conference on Document Analysis and Recognition 1999, Bangalore, pp.169-172

(5)

(b) Vertical runs of white spaces

(c) Horizontal runs of white spaces

(d) Thinned horizontal lines (a) The original document image

Figure 2. The various stages in the segmentation algorithm for a Spectrum magazine page

1 t

•{

— -

t

! ' 1 :

(e) Thinned Vertical lines (f) Final blocks after merging lines

(g) Content Identification of the blocks

(h) Images (black), Large Fonts (lighter)

Figure 3. (continued from figure 2) The various stages in the segmentation algorithm for a Spectrum magazine page.

m

WmL

**>

B

Contentlabel

— . ,

Contents

H

Images

„ - " - " .

Topics

AnImage 1

Anlmage 2

AnImage 3

Figure 4. Results : Element Blocks identified in the Spectrum magazine contents page

[6] Jain A.K., Yu Bin, 'Page Segmentation Using Document Model', International Conference on Document Analysis and Recognition 1997, Munich, Vol.], pp.173-176

[7] Neeti Vohra ad Puneet Gupta 'Wavelet Based characterisation of image classes' B.Tech. Project, IIT Delhi 2000

[SI Doermann David, "The Indexing and Retrieval of Document Images : A Survey", Computer vision and image understand- ing, Vol 70, No 3, June 1998, pp. 287-298. Article no. IV 980692.

[9] Gaurav Harit "Management of Multi-Lingual Document Im- ages", M.Tech thesis, IIT Delhi, New Delhi, Dec 2000.

[lo] Wang J.Z., Wiederhold G., Firschein O., Wei Xin S., 'Con- tent Based Image Indexing and Searching using Daubechies Wavelets', Digital Library (1997), Pg. 311-328.

[111 "DocBase - A database environment for structured documents". Ph.D thesis, Indiana University, Bloomington, Dec 1997.

[12] Puneet Gupta, Neeti Vohra, Santanu Choudhury, S.D.Joshi

"Wavelet based page segmentation" proceedings of ICVGIP 2000, Bangalore, India.