• No results found

Towards e-Library of Telugu literature

N/A
N/A
Protected

Academic year: 2023

Share "Towards e-Library of Telugu literature"

Copied!
45
0
0

Loading.... (view fulltext now)

Full text

(1)

Towards e-Library of Telugu literature

Rajesh Babu Arja (Roll no: 09305914)

Guide : Prof. J.Saketha Nath Co-Guide : Prof. Parag Chaudhuri

Department of Computer Science and Engineering

Indian Institute of Technology Bombay

(2)

1 Introduction

2 Work done in stage1

3 Binarization and Skew detection 4 Segmentation and Noise removal 5 Line Segmentation

6 Classification 7 Post Processing

8 Conclusion and Future Work

(3)

What we want to do?

What we want to do?

Digitalize Telugu literature Why we want to do?

For easy cataloging of documents To enable search on huge literature To preserve the literature of the Telugu

To encourage transliteration of Telugu documents to other

languages

(4)

How can we do?

How can we do?

By Implementing OCR for Telugu

(5)

What is OCR?

(6)

Applications

Library cataloging

Base for many NLP applications.

Applications in banks and post offices.

(7)

Our Scope

Our scope

Scanned Telugu text documents Documents with printed Text format Documents without any images Challenges involved

Resolution of document is not in our control

Noise can be present in the scanned document

(8)

Why Telugu OCR is difficult?

Telugu vs English

English Telugu

No. of classes 52 more than 400

No. of connected components per character 1 1-5 Dependency among connected components No Yes

Confusion Characters No Yes

Table: Comparison of Telugu and English languages

(9)

Confusion characters

Figure: Confusion Characters example

(10)

Why Telugu OCR is difficult?

Telugu vs Other Indian Languages

Devanagari Telugu Word Segmentation easy difficult Confusion Characters less more Table: Comparison of Telugu and Devanagari languages

Tamil Telugu

Vowel modifiers connected disconnected

Table: Comparison of Telugu and Tamil languages

(11)

Comparison of Telugu OCR

S.No Features Classification Font Ind Noise free Open Source

[1] Primitive Decision trees No No n/a

[2] Circular Template match Yes No n/a

[3] Wavelet neural networks No No n/a

[4] Fringe map Template matching No No Yes

[5] Gradient dir Nearest neighbor No No n/a

6 ? ? Yes Yes Yes

Table: Comparison of Telugu OCR literature

(12)

Comparison with state of art Telugu OCR

S.No Input Output Font Ind Noise free Open Source

1 PGM ACI No No Yes

2 any image format txt/pdf Yes Yes Yes

(13)

Our Goal

Our goal is to implement OCR with following features Works as font independent system

Works on noisy data

Capable of digitalizing documents in DLI.

Open Source project

Web interface for OCR

(14)

Work done in stage1

Implemented end to end system Explored various stages of OCR

Implemented each stage with basic methodologies Observed the problems involved in each stage

Identified potential improvements to do in each stage

(15)

Stages of OCR

Stages of OCR Binarization

Skew detection and Correction Noise removal

Segmentation of lines, words and characters Feature extraction and classification

Rendering data

(16)

What is Binarization?

Figure: Binarization example

(17)

Binarization

Our Methodology:

Using global thresholding

Using Java Advanced Imaging API [1]

Results:

Obtained more than 90% accuracy.

(18)

What is Skew Detection?

Figure: Skew Correction example

(19)

Skew Detection and Correction

Our Methodology:

Using Hough’s transformation

Using Java Deskew Implementation [2]

Results:

Correcting skew angle of ±20.

Limitations

Don’t work for multi-skewed document

(20)

What Segmentation?

Line segmentation example Connected component segmentation

(21)

Why Segmentation?

Advantages of Segmentation

To make processing easier in next stages To minimize number of classes

Different types of segmentation Line segmentation

Word segmentation

Connected component segmentation

(22)

Connected component segmentation

Our Methodology

Found 8-way connected components

Implemented similar algorithm as flood fill [3]

Figure: Connected components example

(23)

Why Line Segmentation?

Characters are not labeled sequentially

To bundle the consonant modifiers with corresponding characters

To process document in order

(24)

Our methodology

Our Methodology

We found each line consists of four states

We found histogram of density of black pixels in each line

Used Hidden Markov Model(HMM) for segmentation of lines

Basing on state changes found out line boundaries

(25)

Example

(26)

Example

Before Line segmentation [5] After Line segmentation

(27)

Observations and Future Work

Observations

Works well with uniform spacing documents

Non-Uniform spacing documents are not segmented properly Planned Improvements:

Planning to apply Conditional random fields(CRF) for line

segmentation.

(28)

Feature Vector generation and classification

Our Methodology

Using 0/1 feature vector

Generated synthetic data for three Telugu fonts

Using euclidean distance for classification

(29)

Feature Vector generation and classification

Observations

Dividing the characters into groups will improve accuracy.

Experiments

Found accuracy without classifying connected components into groups

Found accuracy with classifying connected components into

groups

(30)

Feature Vector generation and classification

Results

Method Without grouping With grouping

0/1 33.75% 41.58%

SIFT 10.40% 15.97%

Table: Classification Accuracy statistics

(31)

Post Processing

Our Methodology

Property file created which maps each character lable to corresponding Unicode

Unicode corresponding to classified character lable is picked

Identified the character position and additional care taken

accordingly

(32)

Example

Figure: Example for rendered output text

(33)

Observations

Observations

Word spacing is not completely perfect.

Paragraph beginning with space are not identified.

Headings with large font are not taken care

(34)

Noise

Sources of Noise Quality of printer Quality of scanner Age of the document Advantages of Noise removal

Improves recognition results.

Facilitates better processing in next stages.

(35)

Noise Removal

Our Methodology

Found connected components(CC) Identified three features for CC Extracted three features for all CC

Applied Expectation Maximization(EM) algorithm for clustering

Using Weka’s [4] EM algorithm API.

(36)

Experiments and Results

Experiments

EM algorithm with two clusters EM algorithm with three clusters K-means clustering with two clusters Results

Method Noise Clusters removed Text clusters removed

EM 2-cluster 70% 0.01%

Table: Noise Removal statistics

(37)

Example

(38)

Example

Before Noise removal After Noise removal

(39)

Observations

Observations

Underlines and hand written characters are not removed

Non-Telugu characters are not completely removed

In some documents border noise is not removed

Joint Telugu characters are removed

(40)

Future Work

Planned Improvements:

Including more features like thickness of character Including the structural features of character shapes Including position feature to remove border noise

Trying to implement methods to find overlapping characters

(41)

Conclusion

Created end to end functioning basic OCR.

Binarization with global thresholding giving 90% above accuracy.

Able to correct skew angle of ±20

Able to remove above 70% noise from document

Able to segment lines with more than 95% accuracy

Able to classify characters with more than 30% accuracy

Able to render the characters by finding matching Unicode

(42)

Future Work

Future Work

Consider more structural features for improving accuracy and better noise removal.

Apply CRF for line segmentation

Will use language models for correcting confusion characters and broken characters

Improve classification models.

Implement page layout analysis module

Implement a web based Telugu OCR

(43)

References

Java Advanced Imaging API.

http://java.sun.com/javase/technologies/desktop/media/.

Java Deskew by Hough Transformation. http://www.jdeskew.com.

Flood fill Algorithm. http://en.wikipedia.org/wiki/Flood fill WEKA Java API. http://www.cs.waikato.ac.nz/ml/weka/

Digital Library of India. http://www.dli.gov.in/

Drishti Telugu OCR. http://www.ildc.in/Telugu/htm/lin ocr spell.htm

(44)

References

Rajasekaran S.N.S. Deekshatulu B.L. 1977 Recognition of printed Telugu characters. Comput. Graphics Image Processing,6 pgs.335–360.

Rao P. V. S. and T. M. Ajitha 1995 Telugu Script Recognition - a Feature Based Approach. Proce.of ICDAR, IEEE pgs.323-326,.

Pujari Arun K , C Dhanunjaya Naidu B C Jinaga 2002 An Adaptive Character Recognizer for Telugu Scripts using Multiresolution Analysis and Associative Memory. ICVGIP, Ahmedabad.

Negi Atul, Chakravarthy Bhagvati and.Krishna B 2001 An OCR system for Telugu. Proc. Of 6th Int. Conf. on Document Analysis and Recognition IEEE Comp. Soc. Press, USA,. Pgs. 1110-1114.

Lakshmi C V, C Patvardhan 2003 A high accuracy OCR for printed Telugu text.

Conference on Convergent Technologies for Asia-Pacific Region (TENCON

)Volume 2, Issue, 15-17 Page(s): 725 - 729

(45)

Thank You

References

Related documents

»ñý…£é… Ð@þ$ÇĶý¬ çßý*MæüÆŠÿ Äñý¬MæüP Ððþ¬MæüPË Ð@þÈYMæüÆæÿ×ýÌZ° HOÐðþ¯é Ð@þÊyæþ$ Væü$×ýÐ@þ¬Ë¯@þ$ {ÐéĶý¬Ð@þ¬.. Write any three merits of

(b) When the blood is present in the blood vessel, its fluidity must be maintained.. When the blood comes out of the body, it should

í³…yéÈË »ñýyæþ§æþ °ÐéÆæÿ×ýOMðü ÌêÆŠÿz õßýíÜt…VŠüÞ ¡çÜ$Mö¯@þ² ^èþÆæÿÅ˯@þ$ ^èþÇa…ç³#Ð@þ¬. Discuss the efforts taken by Lord

What are the assumptions of Keynesian theory of Simple Income

JMæü ç³¼ÏMŠü M>ÆöµÆóÿçÙ¯Œþ Äñý¬MæüP HOÐðþ¯é Ð@þÊyæþ$ Ë„æü×ê˯@þ$ {ÐéĶý¬Ð@þ¬?. Write any three features of a

(b) Malar, Kiruba and Begam are partners sharing profits in the ratio of 5 : 3 : 2 Malar decided to retire... Of these, 40 shares were reissued upon payment of `

Read the poem and answer the following questions given below

On lighting a rocket cracker it gets projected in a parabolic path and reaches a maximum height of 4 mts when it is 6 mts away from the point of projection.. Finally it reaches