Design of soft computing models for data mining applications

(1)

Indian Journal of Engineering & Materials Sciences Vol. 7, June 2000, pp. 107-121

Design of s.oft computing models for data mining applications

S Surnathi, S N Sivanandarn & Jagadeeswari

Department of Electrical and Electronics Engineering, PSG College of Techonology, Coimbatore 641 004, India Received 18 May 1999; revised received 27 March 2000

Although modern technologies enabie storage of large streams of data, but there is no technology which can help to understand. analyze and visualize the hidden information in the data. Data mining also called as data or knowledge discovery is the process of analyzing data from different perspectives and summarizing it into useful information. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles. categorize it, and summarizes the relationships identified. Pattern classilication is one particular category of data mining, which enables the discovery of knowledge from very large databases (VLDB). Data mining can be applied to a wide range of applications such as business forecasting, decision support systems, SONAR, RADAR, SEISMIC and medical diagnosis.

Artificial neural networks are used to mine the database which has better noise immunity and lesser training time. A self-organizing neural network architecture called predictive ART or ARTMAP is introduced that is capable of fast stable learning, hypothesis testing in response to arbitrary stream of input patterns. A generalization of binary ARTMAP is the fuzzy ARTMAP, which learns to classify input by a pattern of fuzzy membership values between 0 and I, indicating the extent to which each feature is present. Generalization of fuzzy ARTMAP is the Cascade ARTMAP which has pre-existing symbolic rules that are used to initialize the network before learning so that the network efficiency is increased. This rule insertion also provides knowledge to the network that cannot be captured by training examples. Interpretation of knowledge learned by this neural network leads to compact and simpler rules compared to Back propagation approach. Another self- organizing algorithm is proposed using Kohonen Architecture which also requires lesser time and high prediction accuracy compared to BPN. Moreover, the IlJles extracted from this network are very simple compared to BPN approach. Finally, the extracted rules have been validated for their correctness. This approach is most widely used in the Medical Industry for correct prediction when the 9atabase is large in size. At this time, the manual mining on such a voluminous data is very difficult and also a very time consuming process. Sometimes it may lead to incorrect predictions. Henceforth, the data mining software is developed. The performance evaluation of all three networks namely, Cascade ARTMAP, Fuzzy ARTMAP and Kohonen have been done and compared with conventional methods. Simulation is carried out using the medical data bases taken from the UCI repository of machine learning data bases. The developed data mining software can also be used for other applications like Web, communcations, and pattern recognition.

With the wide use of advanced data base technology developed during past decades, it is not difficult to efficiently store huge volume of data in computers and retrieve them whenever needed. Although the stored data are a valuable asset of any application, more people face the problem of data rich but knowledge poor, sooner or later. This situation aroused the recent surge of research interest in the area of data mining. One of the data mining problem is c1assification^l^.²^. Classification is a process of finding the common properties among different entities and classify them into classes. The results are

of~en expressed in the form of rules.

Of the two approaches widely used by researchers in the area of artificial intelligence (AI) field like symbolic approach based .on decision trees and connectionist approach based mainly on using neural networks, the later approach has been used in this paper for pattern classification. The connectionist

approach is widely used, as it is well suited for classification problems. This paper utilizes the artificial neural networks (ANN) for classification as it out performs the symbolic learning method.

Computer processing and storage technologies have advanced to a great extent with the ability to keep hundreds of gigabytes or even terabytes of data on line. Even though this development in technology has provided the facility of having historical data for decision support application, the analysis methods available are not capable enough to discover knowledge from them. This information discovery through Data mining algorithms like neural networks provides the solution to the above problem²^.

Artificial neural networks are predictive models that are extensively used for pattern classification and decision making. Artificial neural networks provide the rapid learning, perfect decision making needed for real life application. The network's intelligence is

(2)

108 ^{INDIAN J.}ENG. MATER. SCI., JUNE 2000

obtained through encoding of domain knowledge, which enables the network to have fast learning and better classification capability.

Ever since the advent of internet and its wide spread acceptance, data bases have been assuming gargantuan proportions. Though these databases are storehouses of knowledge, effective tools for taping these resources are the need of the hour. This serves as the catalyst to spur research in data mining. Pattern classification and decision. making based on the available data is the key area of interest among the research communities.

The emerging field of knowledge discovery in databases (KDD) has grown significantly in the past few years. In recent years, the technology for computing and storage of data has enabled people to collect and store information to a larger extent.

Modern database technology has storage of large streams of data which has to be analyzed and visualized, which can be done by using Artificial neural network. An artificial neural network approach is used as a data mining tool for analyzing the data. The basic approach used is the back propagation network (BPN) which has the following limitations'.

(i) Neural Network symbolic rules quickly lose their original meanings.

(ii) It is neither able to create nodes during learning nor able to reduce nodes during pruning.

(iii) Learning time is large.

For the same reason, a new neural network architecture called ARTMAP that overcomes the above mentioned forms an efficient network for classification. The use of neural networks 111

classification gives a lower classification error rate and is robust to noise3.4.

This paper concentrates on reducing the learning time by using self-organizing neural network that increases the classification accuracy. With the rule extraction procedure, the symbolic rules extracted from the network are much simpler and more easier to understand. Moreover, rules can be extracted at any stage of the learning process which is impossible in the case of BPN.

A supervised learning system is built up in ARTMAP from a pair of Adaptive Resonance Theory modules ( ARTa and ARTb) that are capable of Self- Organizing stable recognition categories in response to arbitrary sequences of input patterns. During training trial, the ARTa module receives a stream of input patterns a[p] and ARTb receives a stream of input patterns b[p], where b[p] is the correct

prediction given a[p]. These ART modules are linked together by an internal controller that ensures autonomous system operation in real lime. During test trials, the remaining pattern a[p] are presented with out b[p] and their prediction at ARTb are compared with b[p]. The internal controller, increases the vigilance parameter of ARTa increases by a minimal amount needed to correct a predictive error at ARTb.

The generalization of binary ARTMAP system called as Fuzzy ARTMAP system that learns to classify inputs by a pattern of fuzzy membership values between 0 and I indicating the extent to which each feature is present. This generalization is achieved by replacing the ART module in binary ARTMAP system with fuzzy ART modules . Parameter p"

calibrates the minimum confidence that ARTa must have in a recognition category, or hypothesis, activated by an input a[p] in order for ARTa to accept that category, rather than search for a better one through an automatically controlled process of hypothesis testing. Hypothesis testing leads to the creation of new ARTa category, which focuses attention as a new cluster of a[p] input features that is better able to predict b[p]. Another algorithm and architecture suggested for the Artificial r.eural network is the Self- Organizing Maps that has the special property of effectively creating spatially organized "internal representation" of various features of input signals and their abstractions. After training the network, rule extraction process is carried out , and finally the extracted rules are checked for its validity and its correctness.

Cascade ARTMAP Architecture³

A new hybrid system termed cascade adaptive resonance theory mapping (ARTMAP) which incorporates symbolic knowledge into the neural network. This is done by a rule insertion algorithm that translates if then symbolic rules into cascade ARTMAP architecture. This inserted symbolic knowledge is refined and enhanced by the cascade ARTMAP learning algorithm. During learning, the symbolic rule forms are preserved thereby rules extracted from cascade ARTMAP can be compared with the originally inserted rules. The rule insertion, refinement and rule extraction from a paradigm for symbolic knowledge refinement and evaluation is shown in the Fig.l.

During learning, new recognition categories (rules) can be created dynamically to cover the deficiency of the domain theory. Also, each extracted rule is

(3)

SUMATHI et aL.: DATA MINING APPLICATIONS 109

Set initial Train the Refined

Rule Base Rule ^Cascade Network

Rule .. ^Refined

..

ARTMAP

,..

Insertion ^r (by Extraction Rules

Network pruning)

Fig. I -Cascade ARTMAP overall processing, rule insertion, refinement, extraction

associated with a confidence factor that indicates its

i~portance or usefulness. This allows ranking and evaluation of the extracted knowledge.

The cascade ARTMAP can be explained by five stages.

Stage I: Rule Cascade Representation.

Stage 2: Rule Insertion.

Stage 3: Rule Chaining and Inferencing. Stage 4: Rule Refinement and Learning.

Stage 5: Rule Extraction.

Cascade ARTMAP Algorithm

The different steps in Cascade ARTMAP algorithm is listed below.

A and B denote FI"and Fib input vectors respectively.

a { n a a ae <Ie ne } d

x = Xi ,Xh ,Xo ,Xi ,Xh ,Xo an b { b b ^b be be be}

X = Xi ,Xh ,Xo ,Xi ,Xh ,Xo

denote 2M dimensional vectors of FI" and Fib activity vectors respecti vel y. .

Xia and Xib

denote Mi dimensional input attribute vectors.

Xha

and Xhb

denote Mh dimensional intermediate attribute vectors.

xo· and xob denote Mo dimensional output attribute vectors.

ae ae lie ^he be be h I ' b Xi ,Xh ,Xo ,Xi ,Xh ,Xu are t e comp ement attn ute vectors.

Ya ,Yb denote F2" and F2b activity vectors.

x·b denote Map Field Fa" activity vector.

Initially all the weights of jth category node in F2 a

d F ^{b .} ^a ^b d ^ab d h . h

an 2 I.e., Wj , Wj an Wj enole t e welg t vectors fromjth F₂a node to Fab,contain all I"s, i.e., all the nodes are uncommitted.

Step I: (Input Presentation)

Input is presented to ARTa with initial values of choice parameter OC a>

°

^and ^OCb>

°

Learning rates

~. E [0, I] and ~b E [0, I]

Vigilance parameters p" E [0, I] and Pb E [0, I]

During learning Pa = Pb = I;

x"=A

Step2: (Rule Selection) Given xa

,the choice function T/ for each F2' node j is calculated as

I x"Aw" I

T"

=

^J

J OC_A

+

IWj " I

where fuzzy AND operator A IS defined by (pAg)i = min(Pi ,gi)

The category choice indexed at J where T/, = max { Tja

: for all F2" node j}

If more than one T/, is maximum then F2" category with smallest index is selected. When Jth category is chosen, y/= I and y/,=O:j:;t:J.

Then resonance occurs if the match function m/ of node J meets the vigilance criterion

I xaAw"As I

rna _ J J > P

j - I x" As J I a

where s) is the scope vector

s ji = {I if I indexes an input attribute

°

^otherwise

else if m/ < Pa then mismatch reset occurs and T/ is set to zero and step 2 is repeated for another index J.

Step 3: (No prediction)

If the selected F2" node J has no prediction, i.e., W) kab

= I for all Fab

node k.

each node j in the precursor set \jf(J) learns W/,(new) = (I-~a) WtOld) + ~,,(xa A Wj·(Old»

If the input vector B is presented the category node selected as in step 2 and F2b node k learns as

WKb(new)=(I_~b) WKb(oltl) + ~b(Xb A WKb(Old»

if J is associated with K then WJ':

=

I {if k

=

K;

°

^otherwise

and the system halts.

(4)

110 ^I^NDIAN J. ENG. MATER. SCI., JUNE 2000

Step 4: (Irife rencing)

If F2 a node J has learned to make a prediction then WJ kab

activates Fab ab W ab

X

=

^J

Once the map field is active F2b is activated through the one to one pathway between Fab

and

F t

. T b ab I.e., k

=

^X^k

Then category choice K is made as TKb=max{ Tkb: for all F2b node k}

if Kth category is chosen YKb

= I and Ykb=O:k:;tKthen.

xb = W K b (Top-down priming occurs)

Termination is checked by calculating a goal signal g.

M

"

g

= I

⁽^x

^~i ⁺

^X

^~

⁾

j ;l

A conclusion is reached if g > 0 (output attribute IS

known)

Step 5: (Update Memory State)

If g = 0 a conclusion is not reached and x" ^IS updated as

xa(new)

=

^{X,,(old) V}^X^b(old)

Where fuzzy OR v is defined as (pvq); = max(p; ,q;)

then tep 2 is repeated.

Step 6: (Prediction Matching)

If a conclusion is reached (i .e.) g > 0 match function mK^bof the prediction x^bis calculated as

IB A Xh I m^h = - - - -

K IBI

Step 7: (Resonance)

Resonance occurs if mK ^b~ ^Pb. Then F2" and F2 ^b learns their weights and the system halts.

Step 8: (Match Tracking)

If mK b < ^Pbthen prediction mismatc h occurs. Then ART" vigilance P" is raised slightly greater as

p}new) = max {p,,(Old), min{ m/,/j E \jf(J)} + £}

then the process is repeated from step2.

ARTMAP thus handles a class of if-then rules (symbolic rule-based knowledge) which often involves rule cascades and intermediate attributes. A set of rules is said to form a rule cascade when a consequent of a rule also serves as antecedent of another rule. Those attributes that play dual roles are called intermediate attributes. Input attributes serve as antecedents and output attributes serve as consequents. The rule cascade representation is shown in the Fig. 2.

From the Fig 2, the rule cascade representation IS

represented in the rule form as

Rule I. If A and B THEN C

Rule 2. If C and D THEN E.

IL ___

^M^_A_P_F_J_E_L_D ___

~

ARTa

o

A B C D E A B C D E

Fig. 2 - Rule cascade representation

(5)

SUMATHI el al.: DATA MINING APPLICATIONS III

A, B are input attributes in Rule I and C is the output attribute. In Rule 2 'C and 0 are the input attributes and E is the output attribute. Hence, C is called as an intermediate attribute. The rule insertion process proceeds in two phases. The first phase parses all rules for attribute names to setup a symbol table in which each attribute has unique entry. The second phase translates each rule into 2M-dimensional vectors A and B, where M is the total number of attributes in the symbol table. The process of rule insertion into the network gives additional knowledge that cannot be gained during the training process.

These pre-existing symbolic rules can be used to initialize the network before learning. Complement coding is used only if the table contains negative attributes. After rule insertion the network is trained with examples from the input data set If training is sufficient then the network is tested for a specified test ratio. Then the network is refined by pruning process which removes unused nodes and rules are extracted from the rdined network. The overall processing in Cascade ARTMAP system is shown in the FigJ.

Rule cascade representation

A set of rules based on the data base are formed in the form that a consequent of a rule also serves as the antecedent for another rule - rule cascade. These attributes that play dual role both as input and as output attributes are called as intermediate attributes.

Both the input , intermediate and output attributes are

presented to both the ART networks.

For example, let a database contain 4 input attribute and I output attribute. The rule form representation is given below.

Rule I: If ( attribute I) and (attribute 2) then attribute 3.

Rule 2: If(attribute 3) and (attribute 4) then attribute 5.

Where attribute I, 2 and 4 are input attributes, attribute 3 is an intermediate attribute and attribute 5 is the actual output attribute.

Rule insertion

The above formed rules are inserted into the cascade ARTMAP network. If-then rules are translated into the recognition categories of the system. Each attribute is coded by using thermometer coding principle^l. Each rule is translated into 2M dimensional vectors where 'M' is the total number of attributes, as the inputs to ARTa and ARTb modules.

The algorithm derives a pair of vectors,.

attribute I = (a,a"), attribute 2 = (b,b^C⁾ where

a^C^,b^Care the complement vectors corresponding to the respective attributes. If the inserted rules contain no negative attributes, then complemented vectors may be eliminated. These vector pairs are used as training patterns to initialize the network.

Rule chaining and inferencing

In cascade ARTMAP, the attribute fields F_la of

Fl' Match

D

Update

D

I tiYiriY1I'trrrn I ¢::=JI ~ I

Fig. 3 - Cascade ARTMAP overall processing

(6)

11-2 ^I^NDIAN J. ENG. MATER. SCI., JUNE 2000

ARTa and ARTb is identified as working memory. F2a maintains current memory' state xa

and gives antecedents for matching and rule firing. FJ b stores the next memory state xb

desired through rule firing. I. Match Phase:

A choice function T/ for each F₂" node (rule) IS

calculated based on the memory state vector x" . 2. Select Phase:

This is realized by selecting the winner of all F2"

nodes [i.e., F_2"node with largest choice function]. 3. Execute Phase:

The results of the selected rule are read out into FJb . At the end of the cycle, a new memory state x^bis used to update x" in FJ" to prepare for the next inferencing cycle as in Fig. 3.

Learning and rule refinement

In cascade ARTMAP a chain of rule firing is involved for making a prediction. If prediction is made by the F2" node is correct, the weight vector W/

is reduced towards its fuzzy intersection with the FJ"

activity vector x" . As the input is presented in binary pattern, a fixed rule ignore& those features that are absent in the current input. This results in a generalization by reducing the number of features.

When prediction error occurs, a mini-match tracking process raises ART" vigilance POl by slightly more than the minimum match achieved by the fired rules. The rule with the worst match is most likely to be the one, which causes prediction error.

Rule extraction process^S,6

Rule extraction from the data base forms a part of knowledge discovery research. Neural Network is first trained with the coded attributes from the data base. This trained network is then used to identify those attributes that are irrelevant for the classification decision.

Rule extraction extracts rules from the training database. Some inputs also get deleted due to network pr!1ning. This may result in duplication of some rows in the database. These rows are removed before rules are extracted, as they do not contain any new information about the output class. Rule pruning procedure selects small sets of rules for cascade ARTMAP. Rule extraction proceeds in two stages:

Rule pruning

Pruning removes those recogl1ltlon nodes whose confidence factor falls below a selected threshold. The threshold can be set according to the database accuracy obtained. The general value of threshold set is 0.5.

The rule pruning algorithm derives a confidence factor for each F2" category node in terms of its usage i.e., frequency in a training set and its predictive accuracy on a predicting set. For calculating usage and accuracy each F2" category node j maintain three counters: (i) an encoding counter Cj ,that records the number of training patterns encoded by node j; (ii) a predicting counter Pj that records the number of predicting set patterns predicted by node j and; (iii) a success counter Sj ,that record the number of predicting set patterns predicted correctly by node j.

For each training pattern, the encoding counter (Cj ) of each F2" node j in the precursor set \jf(J), where J is the last F2" node (rule) fired that makes the prediction, is incremented by one. For each predicting set pattern, the predicting counter (Pj ) of each F2 ^anode j in the precursor set \jf(J), is incremented by one. If this prediction is correct, the success counter (Sj) of each F2" node j in the precursor set \jf(J), is incremented by one. Based on the three counters namely encoding, predicting and success counter values, the usage (Uj ) ,

accuracy (Aj) of an F2" node j are computed by

Usage (Uj) equals number of training set patterns coded by node j (C) divided by a normalizing factor.

Uj = C/max {Cj : node] predicts outcome K}

Accuracy (Aj) equals the percent of test set patterns predicted correctly by node j (Pj) divided by a normalizing factor.

Aj = P/max {Pj : node] predicts outcome K} Where Pj equals the per cent of predicting set patterns predicted correctly by node j and is computed by

Pj=S/Pj

Uj and Aj are then used to compute the confidence factor for each node j by using the equation

CFj = yUj + (l-y)Aj

where CFj is the confidence factor for node j ,y is the weighting factor E [0, I] which is non ally set to 0.5.

After confidence factors for each node is determined, recognition categories can be pruned from the network using the following strategies.

Threshold pruning

This is the simplest type of pruning where the F2'

nodes with confidence factors below a given threshold

't are removed from the network. The value of 't is normally selected as 0.5. This method of pruning is the fastest method and it provides first cut elimination of unwanted nodes. To avoid over pruning, it is

(7)

SUMATHI el al.: DATA MINING APPLICATIONS 113

sometimes useful to specify the minimum number of recognition categories to be presented in the system.

Local pruning

'Local pruning removes recognition categories one at a time from an ARTMAP network. The base line system performance on the training and the predicting sets is first determined. Then the algorithm deletes the recognition categories with the lowest confidence factor i.e., the hybrid system is pruned using threshold pruning first and then applies local pruning on the remaining smaller set of rules. The category is replaced, however, if its removal degrades system performance on the training and predicting sets.

Simulation Results

Simulation was carried out on set of databases⁷, i.e., Echocardiogram data base.

Echo cardiogram database

medical

The data base gives a data set of 113 persons who have suffered from heart attack at some point in the past. Some are still alive and some are not. The two input attributes namely the "survival months" and

"still alive" when taken together indicate whether a patient survived for at least one year following the he'art attack. This data set is used to predict whether (ALIVE) or not (DEAD) patient will survive at least one year after the heart attack.

The data base contains

9

input attributes and 2 output attributes. The initial 8 rule base knowledge distinguishes among the person alive or dead. Input attributes: (9) Survival months, Still alive, Age, Pericardial fluid, Fractional shortening, Epss, Lvdd, Wall-motion score, Wall-motion-index. Output attributes:(2) Alive, Dead. The initial rule base set is shown in Table I.

The first stage of operation is the rule insertion stage. This stage sets up a symbolic table which has a unique entry for each attribute. Initially, the rule base set is inserted into the network by coding the attributes as shown in Table 2.

Thermometer type of coding method is used for input presentation ^1.8. After inserting the rules, the network is trained using different train to test ratio. The simulation results showed that, with all rules inserted or lesser rules inserted the network gains 100% test set prediction for maximum training set for One iteration as indicated in Tables 3a and 3b.

From the simulation results shown in the Table 3b , it is seen that with training ~amples lesser than 50%

the network gains more than 80% test set prediction.

With little more increase in the training samples, the network reaches 100% test set prediction .

The simulated results for Echo cardiogram samples for two different rule insertion is plotted in the graph and are shown in Graph I and Graph 2.

With Pa set to 0.5. limited number of nodes are created and the network gains 100% accuracy. By selecting Pa less, many output classes combined into a single class due to lesser number of nodes. As result the amount of test set prediction is less. If Pa is increased identical output attributes selects different nodes thereby the accuracy of the test set prediction increases. The simulation is carried out with 50 training samples as shown in Table 4. When the value

Rl

R3

RS

R7

Table I - Sample of rule base inserted into the network for echocardiogram database

IfO::;Sm<22 and Still alive is I and age<40 and Fluid is 0 and 0::;Fs<0.23 and II ::;Epss<31 and 3::;Lvdd<S and 8::;Wms <15 and I:<;;Wmi<2 then alive IfO:<;;Sm<22 and Still alive is I and 40:<;;age<60 and Fluid is I and 0:<;;Fs<0.23 and I I :<;;Epss<3 I and S:<;;Lvdd:<;;7 and IS:<;;Wms :<;;40 and 2:<;;Wmi:<;;3 then alive If 22:<;;Sm<S8 and Still alive is 0 and age<40 and Fluid is 0 and 0.23:<;;Fs:<;;0.6 and O:<;;Epss< I I and 3:<;;Lvdd<S and 8::;W ^ITIS< 15 and I:<;;Wmi<2 then dead If 22:<;;Sm<S8 and Still alive is 0 and 60:<;;age:<;;90 and Fluid is 0 and 0.23:<;;Fs:<;;0.6 and O:<;;Epss< I I and 3:<;;Lvdd<S and 8:<;;Wms <15 and I:<;;Wmi<2 then dead

R2

R4

R6

R8

IfO:<;;Sm<22 and Still alive is 0 and age<40 and Fluid is 0 and 0.23:<;;Fs:<;;0.6 and O:<;;Epss< I I and 3:<;;Lvdd<S and 8:<;;Wms <15 and I:<;;Wmi<2 then dead Ir 0:<;;Sm<22 and Still alive is I and 40:<;;age<60 and Fluid is 0 and 0:<;;Fs<0.23 and I I :<;;Epss<31 and 3:<;;Lvdd<S and 8:<;;Wms <15 and I:<;;Wmi<2 then alive If 22:<;;Sm<S8 and Still alive is I and age<40 and Fluid is 0 and 0.23:<;;Fs:<;;0.6 and O:<;;Epss< I I and S:<;;Lvdd:<;;7 and 15:<;;Wms :<;;40 and 2:<;;Wmi:<;;3 then dead IfO:<;;Sm<22 and Still alive is 0 and 40:<;;age<60 and Fluid is I and 0.23:<;;Fs:<;;O.6 and O:<;;Epss< I I and S:<;;Lvdd:<;;7 and 15:<;;Wms :<;;40 and 2::;Wmi:<;;3

then dead

(8)

114 ^INDIAN J. ENG. MATER. SCI., JUNE 2000

of Pa is less then the number Df nodes are less thereby the number of output classes combine together reducing the test set accuracy. But with high value of

Pa, the number of nodes are increased then, same classes are associated with different nodes thereby increasing the efficiency of test set prediction.

The plot of accuracy/nodes versus prediction is plotted and is shown in graph 3 and graph 4.

After tes[ing, pruning is done to reduce the unused nodes in the F2" layer which does not affect the classification of the accuracy. Pruning thus gi ves limited number of nodes, which removes unwanted nodes from the network there by reducing the complex network into simple network. The rules extracted out of the pruned network are listed in the Table 5.

From the Table 5, it is seen that Rule I is generalization over the missing rule R6 as the

Table 2' - The codi ng or input atlributes for echocardiogram data samples

S.No. Attribute name Numl;ler of bits Interval

I.

2.

3.

4.

5.

6.

7.

8.

9.

Survival months 2 (0-22) , (22-57)

Still alive 2 0, I

Age 3 <40,40-60, >60

Pericardial tluid 2 0, I

Fractional shortening 2 (0-0.23), (0.23-0.6)

EPSS 2 (0-11), > II

LVDD 2 (3-5), (5-7)

Wall-motion score 2 (8-15). (15-40)

Wall-motion index 2 (1-2), (2-3)

~;IOO ._-

~", \

80 - -- -- ----;--

.~-=--=-=~+---:-:-:-:-=-~~-

---

' ./ ' I ,

§

⁶⁰ ^-^{- -}^-^-/~<-

- -- - - - - -

^~

- - - - - -- - -: - - - --- - - - : - - - - - - ---

t1j /r Î Î Î Î

: 40

V:, ----

-~-

-- -- . ---

~

- .. -...

~

--.... --

-~-

--. -.. -.

~ 20~----~----~~----~----~---~

o 20 40 60 BO 100

Training samples

Graph I - Training samples versus test set accuracy for echocardiogram data with 8 rules inserted

;; 100 ///

~

^m^.^{-- -.}^{-.. -}^:-.---^{.. --}^;^-.^-.^.

- --.~.- - .---~,::< - . -

§ _-----r- ^I ^I

t'l ~---: :

~ 60 ~=---~ ---:- ------- -: ------

--o r --- --- ---:- --- --- .-

, ,

'" ^, ^'

~ 400L---J20---~40~--~ffi=·----~8~O----~100·

1r31111119 samples

Graph 2 - Training samples versus test set accuracy for echocardiogram data with 4 rules inserted.

Table 3a - Simulated results for echocardiogram date set with 8 rules inserted p" = 0.5

S.No. Trainrrest set % Test set Nodes/Rules Training

prediction Iterations

I. 01113 30.9725 5/2

2. 25/88 76.1364 6/3

3. 50/63 80.9524 8/4

4. 75/38 84.2105 8/4

5. 100113 100 10/5

Table 3b - Simulated results for echocardiogram date set with 4 rules inserted p" =0.5

S.No. Trainrrest set % Test set Nodes/Rules Training

I.

2.

3.

4.

5.

prediction Iterations

01113 56.6372 5/2

25/88 68.2540 5/3

50/63 76.1364 7/3

75/38 76.3158 9/3

100113 100 11/4

Table 4- Variation ortest set prediction and nodt:s with vigilance parameter for echo samples

S.No. Training samples/p" Test set prediction Nodes created

I.

2.

SO / 0.5 50/0.7

(%)

79.365 I 93.6308

~95~----'---'---'---J

~ ~~

I , t~

~9J -------; --_ .. -.. ---; -. -----...:--=.;~r--

--- -

~

: .-:.---..--:

§

85 --------~ -:=-:-j---~.--~----. -~ -. -.---

~ ~

: :

~ 80 ~ ----. --~ ----. ------~ -----------~ .. -... ------

, ,

8 13

(jj

~75L---~~----~---~---~

0.5 0.55 0.6 0 65 O)

\igilance of.NH a

Graph 3 - Variation of accuracy with increase in vigilance parameter for echocardiogram dataset

14..---.---r---, --- - i

~ ^I ^' --~

!

¹²

^· ^-· ^· ^· ^··· ^·-·!··· ^·- ^-·--· ^- C~-~~-·-::· · · - · ·

Q) ... . . ----') _ _____ t· _ _ _ _ _____ _

E

10 .. -.-..... -;-.:.:-::.:.-'-_.-.• -.-.- ,

i _ ..---.---;

8 0.5 0.55 0.6

Vigilance of ART a

0.65 O}

Graph 4 - Creation of nodes in F_2"layer on increase in vigilance parameter for echocardiogram data samples.

(9)

SUMATHI e/ at.: DATA MINING APPLICATIONS 115

attributes in the rule are sufficient to identify the output class. From the simulation carried out using cascade ARTMAP architecture, the network training time is less as it gain 100% prediction with one iteration itsel f. Accuracy also reaches 100 per cent irrespective of any other neural network architecture.

The simulation results in cascade ARTMAP architecture with rule insertion and input data's are converted to fuzzy membership values ranging between 0 and I for the heart disease training samples is shown in Table 6.

According to row I in Table 6 the interpretation of rules is shown below.

If age is very low to very high and sex is very low to very high and CP is very low

and TBPS is very low to very high and Chol is very low to medium and BS is very low

and Recg is very low to very high and Thalac is very low medium and exang is very low

and Old pk is very low to high

Table 5 - Extracted rules from Echo cardiogram data set Rule 1 1f0<SM<22 Rule 2 IfO<SM<22

and If Still Alive is 0 and If Still Alive is I and IfAGE<90 and If40 <= AGE<90 and If Fluid is 0 and If Fluid is 0

and If 0 <= and If 0 <=

Short <0.23 Short<O.23

and If3<LVDD<7 and If 3 < LVDD <7 and If 8 <= WS <40 and 11'8 <= WS<40 and If I <WI <3 and If I <WI <3

and then Dead and then Alive

Rule 3 If 0< SM <22 Rule 4 If22<=SM<57 and If Still Alive is I and If Still Alive is 0 and IfAGE<90 and If40 <= AGE<90 and If Fluid is I and If Fluid is 0 and If 0 <= and If 0.23 <=

Short <0.23 Short<0.6

and If3<LVDD<7 and If 0<= EPSS < II and If 8 <= WS <40 and If3<LVDD<7 and If I <WI <3 andlf8<=WS<40 and then Alive and I f I < WI < 3

and then Dead

and slope is very low to high and Ca is very low to very high and Thai is very low

then heart may be healthy

Fuzzy ARTMAP Architecture⁹•10

Fuzzy ARTMAP, a generalization of binary ARTMAP, learns to classify inputs by a pattern of Fuzzy membership values between zero and one, indicating the extent to which each feature is present.

The Fuzzy ARTMAP system incorporates two fuzzy ART modules ART" and ARTb that are linked together via an inter-ART module, Fab called as the map field. The map field forms predictive associations between ART" and ARTb categories. In classification tasks, each node in the ART" field F2a codes a recognition category of ART" input pattern.

During training each such node learns to predict an ARTb category as shown in Fig.4. The interactions mediated by the map field , F"b is characterized as follows.

Inputs to ART" and ARTb are in the complement coded form. For ART" , I=A=(a,aC

) and for ARTb , I=B=(b,bC

). This process of complement coding is called as normalization. Normalization of fuzzy ART inputs presents category proliferation, i.e., it uses on cells and off-cells to represent the input pattern and preserves individual feature amplitudes, while normalizing the total on-ceilloff-cell vector.

The complement coded Fo to FI input

r

is the 2M- dimensional vector.

I=A=(a,aC

)=(a"a2, .... aM, alc •... aMC

) where

at' =

^I^-a;

A complement co.ded input IS automatically normalized because

III = l(a,aC)1 = I i=,Ma; + (M-I;=, M a;) = M.

For ART" ,X" denotes the FI a output vector , ya denotes the F2 a output vector, and w/, denotes the jth ART" weight vector. For ARTb ,Xb denotes the Fib output vector , yb denotes the F2b output vector, and

Wk b denotes the kth ART h weight vector, For the map field , X^ab denotes the F"b output vector and wah denotes the weight vector from the jth F/ node to Fa6.

Table 6 - Sample set of 5 rules extracted from heart disease samples

[Interpretation of quantized weight values 1-very low 2-low. 3- medium. 4-high and 5-very high] Pred. Age Sex CP TBPS Chol BS Recg Thalac Exang Old pk Slope Ca Thai

+ 1-5 1-5 I-I 1-5 1-3 I-I 1-5 1-3 I-I 1-4 1-4 1-5 I-I

+ 1-4 I-I I-I 1-4 1-4 1-5 I-I 1-3 1-5 1-5 1-4 1-5 I-I

1-4 I-I 1-5 1-5 I-I 1-5 I-I 1-4 I-I 1-5 1-4 1-5 I-I

I-I I-I I-I 1-5 1-5 1-5 1-5 1-4 1-4 1-5 1-4 I-I I-l

1-4 I-I I-I 1-4 1-4 1-5 I-I 1-3 1-5 1-5 1-4 1-5 1-

(10)

116 INDIAN J. ENG. MATER. SCI., JUNE 2000

.b Wjk

ART,

Fig. 4 - Fuzzy ARTMAP processing

ART Field Activity Vector

Each system includes a field Fo of nodes that represent a current input vector, a field FI that receives both bottom- up input from Fo and top down input from a field F2 that represents the active code or category.

Vector I denotes Fo activity, Vector x denotes FI activity, Vector y denotes F2 activity. Associated with each F2 category node j is a vector Wj of adaptive weights, or long term memory (LTM) traces.

Initially. Wjl(O) = ... WjM(O) = I , which means that each category is uncommitted. After a category codes its first input, it becomes committed. A choice parameter a> 0, a learning rate parameter

P

^E^[0,^I]

and vigilance parameter pE [0, I]. For each input I and F2 node j , the choice function Tj is defined by

IlAw I

T[I]

=

^I

J DC +1 w ⁱI

Where the fuzzy intersection A is defined by (pAq)i = min(pi ,qi)and where the norm 1.1 is defined by

Ipi =Li=IM Ipil

The system makes a category choice when at most one F2 node can become active at a given time. The index J denotes the chosen category, where

The category choice indexed at J where Tj = max

I

Tj : for all F2 node j )

If more than one Tj is maximum then F2 category with smallest index is selected. When Jth category is chosen, YJ = I and Yj = 0 :j :;t:J.

Then resonance occurs if the match function mJ of node J meets the vigilance criterion

IlAwj I m = >p

J I I I -

If the above equation is not satisfied then mismatch reset occurs where the value of the choice function Tj is set to 0 for the duration of the input presentation. The search process continues untill a chosen category J satisfies the matching criterion . Once search ends, the weight vector WJ learns according to the equation

W/

^new⁾

=

^{(I-P) W/o}^ld⁾+ P(lA ^W/O^ld»

Map field activation

Map field is activated whenever one of the ART" or ARTb categories is active. The F"b output vector xab obeys the following expression.

l Y"AW;h

if theJth F;' nodeis active and F; is active

X"h=

wt

^iftheJthF^;'nodeisactiveandF;isinactive

y" if the F;' node is inactive and F; is active

o

^irtheF;' nodeisinactiveandF~'isinactive.

Thus, if the prediction made by W/,b is disconfirmed by

l

^then ^xab=O.^Thu^sthis mismatch triggers an ARTa search for a better category, by a Match tracking process listed below.

Match tracking

At the beginning of input presentation Pa is set to baseline vigilance p".

I X "h I I X "h I P - ^if P < --

"h -~ "to Iyh I Map field vigilance

I Ai\.w; I then p" is increased slightly greater than

IAI where A is the input to FI" in the complement coded form (a,ac). This search leads to the activation of I Ai\. w; I I

y"

A W ~'h I another F2" node J with I A I ~ p" and I y" I

~ P_ab

If no such node exits then F~' is shutdown for the rest of the input patterns.

Map field learning

I mtla . . II y, Wjk ab· tn F" 2 to F"b satls y W. f jk ab(O) =; I During resonance with ARTa category J active w/,b approaches the map field vector X"b. With fast learning, once J learns to predict the ARTb category K, that association is permanent, i.e., wJ/^b= I for all time.

Rule Extraction Procedure^S,6

Rule extraction in fuzzy ARTMAP proceeds in two phases:

(11)

SUMATHI el al.: DATA MINING APPLICATIONS 117

Quantization of weight values

Quantization is the method used to describe the rules in words rather than real numbers, for which the feature values represented by weights

Wi/

^were

quantized.

A quantization level Q is defined as the number of feature values used in the extracted fuzzy rules. For example with Q

=

3 , the feature values are described as low, medium, high in the fuzzy rules. There are two methods of quantization.

(i) Quantization by Truncation

Divide the range of [0, I] into Q intervals. Assign a quantization point to the low6r bound of each interval.

i.e. for q

=

^{I ...}^{Q ,where}^'q'is the quantization point.

Let Vq=(q-I)/Q.

When a weight w falls in interval q, reduce the value of w to Vq as shown in the Fig. Sa.

(ii) Quantization by Round-off

In this method of quantization, the Q quantization points are evenly distributed in the range of [0, I], with one at each end point, i.e., for q

=

I ... Q

Vq

=

(q-I )/(Q-I).

When a weight w falls in interval q , the weight value 'w' is rounded off to the nearest value V'I as shown in the Fig. Sb.

According to the simulation done earlier, the results obtained by using the two methods were similar and hence, for the simulations used here, the method of quantization by truncation is followed for the two sets of data samples taken.

Fuzzy ARTMAP results

Echocardiogram data

This data set contains I 13 samples of which 30 samples decide whether a person is alive for at least one year following the heart attack and 83 samples decides whether a person is not alive for one year

VI V2 V-.1 V4 Vs Vo ^V^I

1II1II ,II1II ,II1II

, '"

¹^II1II

0

Weight Value

Fig.Sa - Truncation process

following the attack. Data sets partitions are 72/40 and 40/40/33.

Table 7 shows that increase in the accuracy of the test set prediction as the number of training epochs are increased. The value of a is set to 0.0 I, which is a minimum value so that learning is fast. ~ are set to I. Initially, the vigilance parameter is set to a low value 0, then it is automatically increased during match tracking. If the vigilance is set to large value initially, then the number of nodes created for lesser number of training sets are high. The input data's are converted into fuzzy membership values based on the PI function between 0 to I (Ref. I I).

Table 8 indicates that pruning yields 72.S%

training set prediction, 72.S% and 62.S test set prediction and by quantization (Q=S) the performance slightly degrades but the performance is tolerable with pruning alone, which has better accuracy than with the process of quantization.

Thus fuzzy ARTMAP takes larger training time say about 3-S epochs to gain maximum accuracy of

91.66% which is less compared to cascade ARTMAP

and the training time is also high. But when compared with the BP approach the training time is very less and the overall time taken for the execution of the samples are also very less and the rules extracted from the network are also much simpler when compared to BP approach.

The graph for the performance of fuzzy ARTMAP for increasing number of epochs is shown in graph S.

Table 9 shows the quantized weights of Echo Cardiogram samples taken. According to this table, row 3 can be transplanted into the rule form as below:

If SM is very low to very high and SA is very high

and age is very low to very high and fluid is very low to very high and short is low to very high and EPSS is very low to medium

Y2 V3 Y4 V) YO

, , , , ,

0

Weight Value Fig.5b - Round off process Arrows indicate the direction of quantization.

Design of soft computing models for data mining applications