Data mining task primitives

(1)

Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems with a database or data warehouse system, Major issues in data mining, Data pre-processing:

data cleaning, data integration and transformation, data reduction etc.

(2)

If a data mining system is not integrated with a database or a data warehouse system, then there will be no system to communicate with. This scheme is known as the non- coupling scheme. In this scheme, the main focus is on data mining design and on developing efficient and effective algorithms for mining the available data sets.

The list of Integration Schemes is as follows −

Integration of data mining systems with a database or data warehouse system

(3)

1. No Coupling − In this scheme, the data mining system does not utilize any of the database or data warehouse functions. It fetches the data from a particular source and processes that data using some data mining algorithms.

The data mining result is stored in another file.

(4)

2. Loose Coupling − In this scheme, the data mining system may use some of the functions of database and data warehouse system. It fetches the data from the data respiratory managed by these systems and performs data mining on that data. It then stores the mining result either in a file or in a designated place in a database or in a data warehouse.

(5)

3. Semi−tight Coupling − In this scheme, the data mining system is linked with a database or a data warehouse system and in addition to that, efficient implementations of a few data mining primitives can be provided in the database.

(6)

4. Tight coupling − In this coupling scheme, the data mining system is smoothly integrated into the database or data warehouse system. The data mining subsystem is treated as one functional component of an information system.

(7)

Major issues in data mining

Major issues in data mining regarding mining methodology, user interaction, performance, and diverse data types. These issues are following:

Mining Methodology and User Interaction

Performance Issues

Diverse Data Types Issues

(8)

(9)

1. Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −

Mining different kinds of knowledge in databases −

Different users may be interested in different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery task.

(10)

Interactive mining of knowledge at multiple levels of

abstraction − The data mining process needs to be interactive because it allows users to focus the search for patterns, providing and refining data mining requests based on the returned results.

Incorporation of background knowledge − To guide

discovery process and to express the discovered patterns, the background knowledge can be used. Background knowledge may be used to express the discovered patterns not only in concise terms but at multiple levels of abstraction.

(11)

Data mining query languages and ad hoc data

mining − Data Mining Query language that allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query language and optimized for efficient and flexible data mining.

Presentation and visualization of data mining

results − Once the patterns are discovered it needs to be expressed in high level languages, and visual representations. These representations should be easily understandable.

(12)

Handling noisy or incomplete data − The data

cleaning methods are required to handle the noise and incomplete objects while mining the data regularities. If the data cleaning methods are not there then the accuracy of the discovered patterns will be poor.

Pattern evaluation − The patterns discovered should be interesting because either they represent common knowledge or lack novelty.

(13)

2. Performance Issues

There can be performance-related issues such as follows −

Efficiency and scalability of data mining

algorithms − In order to effectively extract the information from huge amount of data in databases, data mining algorithm must be efficient and scalable.

(14)

Parallel, distributed, and incremental mining

algorithms − The factors such as huge size of databases, wide distribution of data, and complexity of data mining methods motivate the development of parallel and distributed data mining algorithms. These algorithms divide the data into partitions which is further processed in a parallel fashion. Then the results from the partitions is merged. The incremental algorithms, update databases without mining the data again from scratch.

(15)

3. Diverse Data Types Issues

Handling of relational and complex types of data − The

database may contain complex data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system to mine all these kind of data.

Mining information from heterogeneous databases and

global information systems − The data is available at different data sources on LAN or WAN. These data source may be structured, semi structured or unstructured. Therefore mining the knowledge from them adds challenges to data mining.

(16)

Data Mining Task Primitives

We can specify a data mining task in the form of a data mining query.

This query is input to the system.

A data mining query is defined in terms of data mining task primitives.

These primitives allow us to communicate in an interactive manner with the data mining system.

(17)

List of Data Mining Task Primitives −

Set of task relevant data to be mined.

Kind of knowledge to be mined.

Background knowledge to be used in discovery process.

Interestingness measures and thresholds for pattern evaluation.

Representation for visualizing the discovered patterns.

(18)

Data mining task primitives

A data mining query is defined in terms of the following primitives:

1. Task-relevant data: This is the database portion to be investigated. For example, suppose that you are a manager of All Electronics in charge of sales in the United States and Canada. In particular, you would like to study the buying trends of customers in Canada. Rather than mining on the entire database. These are referred to as relevant attributes.

(19)

Data mining task primitives

2. The kinds of knowledge to be mined: This specifies the data mining functions to be performed, such as characterization, discrimination, association, classification, clustering, or evolution analysis. For instance, if studying the buying habits of customers in Canada, you may choose to mine associations between customer profiles and the items that these customers like to buy

(20)

Data mining task primitives

3. Background knowledge: Users can specify background knowledge, or knowledge about the domain to be mined. This knowledge is useful for guiding the knowledge discovery process, and for evaluating the patterns found. There are several kinds of background knowledge.

(21)

4. Interestingness measures: These functions are used to separate uninteresting patterns from knowledge. They may be used to guide the mining process, or after discovery, to evaluate the discovered patterns. Different kinds of knowledge may have different interestingness measures.

5. Presentation and visualization of discovered patterns: This refers to the form in which discovered patterns are to be displayed. Users can choose from different forms for knowledge presentation, such as rules, tables, charts, graphs, decision trees, and cubes.

(22)

(23)

(24)

Data pre-processing is a data mining technique which is used to transform the raw data in a useful and efficient format.

Steps Involved in Data Pre-processing:

1.Data Cleaning: The data can have many irrelevant and missing parts. To handle this part, data cleaning is done.

It involves handling of missing data, noisy data etc.

(25)

(a).Missing Data: This situation arises when some data is missing in the data. It can be handled in various ways.

Some of them are:

Ignore the tuples: This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple.

Fill the Missing values: There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the most probable value.

(26)

b).Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated due to faulty data collection, data entry errors etc. It can be handled in following ways :

1. Binning Method: This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and then various methods are performed to complete the task. Each segmented is handled separately.

One can replace all data in a segment by its mean or boundary values can be used to complete the task.

(27)

2.Regression: Here data can be made smooth by fitting it to a regression function. The regression used may be linear (having one independent variable) or multiple (having multiple independent variables).

3.Clustering: This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters.

(28)

2.Data Transformation: This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves following ways:

1.Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2.Attribute Selection: In this strategy, new attributes are constructed from the given set of attributes to help the mining process.

(29)

3.Discretization: This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.

4.Concept Hierarchy Generation: Here attributes are converted from level to higher level in hierarchy. For Example-The attribute “city” can be converted to

“country”.

(30)

3. Data Reduction: Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results.

Strategies for data reduction include the following.

(31)

1. Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.

2. Dimension reduction, where irrelevant, weakly relevant or redundant attributes or dimensions may be detected and removed.

3. Data compression, where encoding mechanisms are used to reduce the data set size.

(32)

4. Numerosity reduction, where the data are replaced or estimated by alternative, smaller data representations such as parametric models (which need store only the model parameters instead of the actual data), or nonparametric methods such as clustering, sampling, and the use of histograms.

5. Discretization and concept hierarchy generation, where raw data values for attributes are replaced by ranges or higher conceptual levels. Concept hierarchies allow the mining of data at multiple levels of abstraction, and are a powerful tool for data mining.