• No results found

Novel algorithms and their applications for mining static, structured, temporal data and data streams

N/A
N/A
Protected

Academic year: 2023

Share "Novel algorithms and their applications for mining static, structured, temporal data and data streams"

Copied!
11
0
0

Loading.... (view fulltext now)

Full text

(1)

NOVEL ALGORITHMS AND THEIR APPLICATIONS FOR MINING STATIC, STRUCTURED, TEMPORAL DATA

AND DATA STREAMS

SHALINI BHASKAR

AMARNATH AND SHASHI KHOSLA SCHOOL OF INFORMATION TECHNOLOGY

INDIAN INSTITUTE OF TECHNOLOGY DELHI

AUGUST 2012

(2)

© Indian Institute of Technology Delhi (IITD), New Delhi, 2012

(3)

NOVEL ALGORITHMS AND THEIR APPLICATIONS FOR MINING STATIC, STRUCTURED, TEMPORAL DATA

AND DATA STREAMS

By

SHALINI BHASKAR

AMARNATH AND SHASHI KHOSLA SCHOOL OF INFORMATION TECHNOLOGY

Submitted

in fulfillment of the requirements of the degree of

Doctor of Philosophy

to the

INDIAN INSTITUTE OF TECHNOLOGY DELHI NEW DELHI-110016, INDIA

AUGUST 2012

(4)

i

Certificate

This is to certify that the thesis entitled “Novel Algorithms and their Applications for Mining Static, Structured, Temporal Data and Data Streams”, which is being submitted by Shalini Bhaskar for the award of the degree of Doctor of Philosophy in Information Technology to the Indian Institute of Technology Delhi, is a bona fide research work done under my guidance and supervision.

The thesis has reached the standard fulfilling the requirements of the regulations relating to the degree. The results obtained in the thesis have not been submitted to any other University or Institute for the award of any degree or diploma.

Dr (Mrs.) B.Chandra Professor

Department of Mathematics

Indian Institute of Technology Delhi Hauz Khas, New Delhi -110016 India

August 2012

(5)

iii

Acknowledgements

I convey my deepest gratitute to my supervisor Prof. B. Chandra for patient discussions, active support, constant encouragement and extra ordinary guidance in completing this thesis. Her in-depth knowledge, valuable comments and suggestions provided high-quality foundation for the present thesis. I am indebited to her for her untiring efforts, constant support and valuable guidance. No words can ever describe the efforts of Prof. B. Chandra for motivating me to set yardsticks and inspired me to achieve higher goals.

I wish to acknowledge the Indian Institute of Technology Delhi for providing me with outstanding research facilities. I extend my sincere thanks to my SRC members, Prof. Sanjiva Prasad, Prof. M.P. Gupta and Prof. Prem Kumar Kalra for their time, advice and suggestions during the course of my research work. I would also like to thank the Head of the Department for all the help and support provided to me.

I am extremely gratified to my parents for their blessings and prayers. I am grateful to my sister and brother for listening to me patiently and encouraging me for performing better. I am eternally thankful to my mother-in-law, my husband and my children for endless serenity, unconditional love and burly support without which it would have been difficult to accomplish the task.

Above all, I am highly grateful to the Almighty GOD for His blessings in completing my objective.

Shalini Bhaskar

(6)

v

Abstract

Data mining provides efficient tools for knowledge discovery from data. There are broadly three techniques in data mining viz. classification, clustering and association rule mining. The thesis is devoted to developing new techniques in association rule mining and illustrates how they can be used in important applications. Various aspects of association rule mining covered in the thesis include weighted association rule mining, data stream mining, structured data mining, sequence mining and generating an efficient sample from market basket data.

In the area of weighted association rule mining, Hub-Averaging of link analysis has been applied to assign weights to items and transactions. This has been used in conjunction with pattern growth to find frequent itemsets from market basket data.

In data stream mining, traditionally, three types of windows are being used, namely landmark, damped and sliding windows. It is computationally expensive to use these types of windows. Hence, a novel approach of using jumping window has been attempted for data stream mining. In order to preserve the frequent itemsets in the previous windows, fuzzification of support of closed frequent itemsets has been attempted.

In the area of tree mining an algorithm Mine-Tree is proposed that mines induced subtrees from labeled ordered rooted trees. It employs efficient candidate generation procedure and uses binary search tree to avoid generating redundant candidates, thereby increasing the efficiency of finding frequent substructures.

In graph mining EFSGM (Efficient Frequent SubGraph Mining) algorithm is proposed for efficiently mining frequent subgraphs from the database of directed graphs. It uses equivalence class principle to identify isomorphic candidates. This reduces the number of candidates to be explored. An important application of the

(7)

vi

graph mining algorithm EFSGM for finding frequent substructures from carcinogenic chemical compound database has also been presented in this chapter of the thesis which helps in reducing the search space in the drug discovery process.

Another important research area explored in the thesis is on sequence mining. An efficient storage structure has been proposed for storing candidate itemsets/sequences. The candidate generation procedure suggested for the proposed algorithm captures all instances of an itemset/sequence which helps in not missing any itemset/sequence.

With the increase in the volume of data getting generated, designing algorithms for generating an efficient sample has become an important research area. Hence, an algorithm to generate an efficient sample from market basket data has been proposed in the thesis. The algorithm generates a sample that can represent the entire database by giving weights to items and transactions. A new decision criterion has been evolved to efficiently generate the final sample.

The effectiveness of all the newly developed algorithms has been illustrated on synthetic as well as benchmark datasets.

(8)

vii

Contents

Certificate…..…..….………...………..…..….……...….…….……….……...i

Acknowledgements...………...………..…..….….……..…….iii

Abstract ...……….………..…………...….…..……….…………....v

List of Figures..….………..………...………...xi

List of Tables .………..………..…………..……..……..………....……….xv

1. Introduction ... 1

1.1 Motivation ... 8

1.2 Thesis Organization ... 9

2. Patterned Growth Algorithm using Hub-Averaging without Pre-assigned Weights ... 13

2.1 Overview of Existing Algorithms for Weighted Association Rule Mining .... 14

2.2 Overview of Hub-Averaging Algorithm ... 16

2.3 Proposed Algorithm – HAP-Growth (Hub-Averaging Pattern Growth) for Finding Frequent Itemsets using Hub-Averaging ... 18

2.4 Performance Evaluation ... 22

2.4.1 Comparative Performance Evaluation on IBM Synthetic Datasets .. 23

2.4.2 Comparative Performance Evaluation on Benchmark Datasets ... 27

2.5 Concluding Remarks ... 29

3. A Novel Approach for Finding Frequent Itemsets in Data Streams ... 31

3.1 Overview of Related Work in Data Stream Mining ... 33

(9)

viii

3.2 Proposed Algorithm for Data Stream Mining ... 35

3.2.1 Defuzzification ... 42

3.3 Illustrations of the Proposed Algorithm ... 43

3.4 Performance Evauation ... 46

3.4.1 Performance Evaluation on Synthetic Datasets generated using IBM Synthetic Data Generator ... 48

3.4.2 Performance Evaluation on Synthetic Datasets used by Chi et al. (generated using IBM Synthetic Data Generator) ... 54

3.4.3 Results on Benchmark Datasets taken from the UCI Machine Learning Repository ... 60

3.5 Concluding Remarks ... 63

4. Mining Induced Subtrees from Labeled Ordered Rooted Trees ... 65

4.1 Basic terminologies used in Tree Mining ... 66

4.2 Overview of Existing Algorithms in Tree Mining ... 68

4.3 Proposed Algorithm---Mine-Tree ... 70

4.3.1 Candidate Subtree Generation... 71

4.3.2 Identifying Frequent Subtrees from Candidate Subtrees ... 73

4.3.3 Pseudo code and Complexity of the Proposed Algorithm ... 75

4.4 Illustrations ... 79

4.5 Performance Evaluation on Synthetic and Benchmark Datasets ... 85

4.5.1 Comparative Performance Evaluation on Synthetic Datasets... 85

4.5.2 Comparative Performance Evaluation on Benchmark Datasets ... 89

4.6 Concluding Remarks………..………...92

5. Mining Frequent Subgraphs from Graphs ... …………..95

5.1 Basic terminologies used in Graph Mining ... 96

5.2 Overview of Related Work in Graph Mining ... 97

5.3 Proposed Approach – EFSGM (Efficient Frequent SubGraph Mining) .. 100

(10)

ix

5.3.1 Generate Candidate Subgraphs using L-R join, Serial Extensions

and Mixed Extensions... 101

5.3.2 Storing Distinguishing Features of the Vertices of a Candidate Graph/Subgraph ... 104

5.3.3 Building Canonical Representation ... 107

5.3.4 Identifying Isomorphic Graph/Subgraph ... 109

5.3.5 Computational Complexity of EFSGM ... 110

5.4 Performance Evaluation ... 112

5.5 Identifying Frequent Toxic Substructures in Chemical Compounds using EFSGM……….119

5.5.1 Introduction to Drug Discovery Process and use of Graphs for representing Chemical Compounds………....119

5.5.2 Results of Graph Mining……….121

5.6 Concluding Remarks………..126

6. A New Approach for Mining Frequent Sequences from Temporal Database ... 127

6.1 Overview of Existing Algorithms in Sequence Mining ... 129

6.2 Proposed Algorithm for Sequence Mining ... 131

6.2.1 Itemset Extension ... 134

6.2.2 Sequence Extension ... 137

6.2.3 Complexity of the Proposed Algorithm ... 139

6.3 Illustration of the Proposed Algorithm and PRISM Algorithm ... 140

6.3.1 Working of PRISM Algorithm ... 141

6.3.2 Working of the Proposed Algorithm ... 143

6.3.2.1 Itemset Extension and Sequence Extension ... 145

6.4 Performance Evaluation ... 149

6.4.1 Comparative Performance Evaluation on Synthetic Datasets ... 149

(11)

x

6.4.1.1 Performance Evaluation in terms of Execution Time

and Number of Frequent Sequences ... 150

6.4.1.2 Performance of Proposed Algorithm in terms of Scalability……….……….152

6.4.2 Comparative Performance Evaluation on Benchmark Datasets .. .154

6.5 Concluding Remarks ... 157

7. A New Approach for Generating an Efficient Sample from Market Basket Data ... 159

7.1 Overview of Existing Sampling Algorithms ... 160

7.2 Proposed Algorithm for generating an Efficient Sample from Market Basket Data ... 162

7.3 Performance Evauation ... 168

7.4 Concluding Remarks ... ………..171

8. Conclusions ... 173

References ... 179

List of Publications ... 189

Bio-Data ... 191

References

Related documents

Journal of Infrared Physics and Technology, Elsevier Inc.. Rajesh, Gaurav Singhal and A. Loan, Mainuddin, 2019 EEE SOI-3D-Subthreshold Microelectronics Technology

 Automated data collection tools and mature database technology lead to tremendous amounts of data stored data warehouses. and other

Graph Mining, Social Network analysis and multi- relational data mining, Spatial data mining, Multimedia data mining, Text mining, Mining the world wide

Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems with a database or data

The significant components of data mining systems are a data source, data mining engine, data warehouse server, the pattern evaluation module, graphical user interface,

One particularly important problem in the area of sequential rate mining is the problem of discovering all subsequences that appear on a given sequence database

The performance of supervised Kohone n Architecture using linear vector qu anti za tion (L VQ) is s hown in Table 12. As the number of epoc hs increa1es the netw

Classification and Regression Tree (CART) is a data exploration and prediction algorithm with a structure of a simple binary tree .. Classification and Regression