CS 621 Artificial Intelligence Lecture 12 – 30/08/05
Prof. Pushpak Bhattacharyya
Fundamentals of Information Theory
Weather (0)
Temp (T)
Humidity (H)
Windy (W)
Decision (D)
Sunny High High F N
Sunny High High T N
Cloudy High High F Y
Rain Med High F Y
Rain Cold Low N Y
Rain Cold Low T N
Weather (0)
Temp (T)
Humidity (H)
Windy (W)
Decision (D)
Sunny Med High F N
Sunny Cold Low F Y
Rain Med Low F Y
Sunny Med Low T Y
Cloudy Med High T Y
Outlook
Cloudy Rain Sunny
Windy Yes
Humidity
High Low T F
Rule Base
R1: If outlook is sunny and if humidity is high then Decision is No.
R2: If outlook is sunny and if humidity is low then Decision is Yes.
R3: If outlook is cloudy then Decision is Yes.
Making Sense of Information
• Classification
• Clustering
• Giving a short and nice description
Short Description
Occam Razor principle
(Shortest/simplest description is the best for generalization)
Representation Language
• Decision tree.
• Neural network.
• Rule base.
• Boolean expression.
The example data presented in the form of rows and labels has low ordered/structured
information compared to the succinct description (Decision Trees and Rule Base).
Define “information”
Lack of structure in information by “Entropy”
Information & Entropy
Define Entropy of S (Labeled data)
E(S) = - ( P+ log2P+
+
P- log2P- )P+ = proportion of positively labeled data.
P- = proportion of negatively labeled data.
Example
P+ = 9/14 P1 = 5/14
E(S) = - 9/14 log2 (9/14) – 5/14 log2 (5/14)
Partitioning the Data
“Windy” as the attribute Windy = [ T, F]
Windy = T : Partitioning the data
Partitioning by focusing on a particular attribute produced
“Information gain”
“Reduction in Entropy”
Partitioning the Data (Contd)
Information gain when we choose windy = [ T, F ] Windy = T, P+ = 6 , P- = 2
Windy = F, P+ = 3 , P- = 3
Partitioning the Data (Contd)
Windy
T F
6, + 2, -
3, + 3, -
Partitioning the Data (Contd)
Gain(S,A) =
= E(S) -∑( |Sv| / |S| )E(Sv)
v є values of A
E(S) = 0.914 E(S, Windy):
E( Windy=T)
Partitioning the Data (Contd)
E( Windy=F)
= - 3/6 log 3/6 – 3/6 log 3/6
= 1.0
Partitioning the Data (Contd)
Gain(S,Windy) =
= 0.0914 – (8/14 * v + 9/19* 1.0)
= N
Exercise: Find information gain for each attribute:
outlook, Temp, Humidity and windy.
Partitioning the Data (Contd)
ID3 Algorithm
Calculating the gain for every attribute and choosing the one with maximum gain to finally
arrive at the decision tree is called “ID3”
algorithm to build a classifier.
Origin of Information Theory
1) Shannon “The mathematical Theory of
communication”, Bell systems Journal, 1948.
2) Cover and Thomas, “Elements of Information Theory”, 1991.
Motivation with the example of a horse race 8 horses - h1,h2……h8
Person P would like to bet on one of the horse.
The horse have probability of winning as follows:
Example
• h1 = 1/5 h5 = 1/64
• h2 = 1/4 h6 = 1/64
• h3 = 1/8 h7 = 1/64
• h4 = 1/16 h8 = 1/64
Example (Contd 1)
Send message specifying the horse on which to bet.
If the situation is “unbiased” i.e., all horses have equal probability of winning then we need 3
binary units.
Example (Contd 2)
Compute the bias
E(s) = - ∑ Pi Log Pi
i = 1,… 8
Pi = Probability of hi winning
E(s) = - ( ½ log ½ + ¼ log ¼ + ……
Example (Contd 3)
On the average we do not need more than 2 bits to communicate the desired horse.
Actual length of code ?
Example (Contd 4)
Example 2 ( Letter Guessing Game)
P t K a i u
1/8 1/4 1/8 1/8 1/4 1/8 20 – Question game
E(s) = - ∑ Pi Log2 Pi
i = {p, t, k, a, i , u }
On the average we need no more than 2.5 questions.
Design a code:
P t K a i u
1/8 1/4 1/8 1/8 1/4 1/8 100 00 101 110 01 111
Q1) Is the letter t or I ? Q2) Is it a constant ?
Expected number of questions = ∑ Pi * Ni
Where Ni = # questions for Situation i.
What has all this got to do with AI ? Why entropy?
Why design codes?
Why communicate ?
Bridge
Multiparty participation is intelligent transformation processing.
Information gain sets up theoretical limits in communicability.
Summary
• Haphazard presentation of data is not acceptable to MIND.
• Focusing attention on an attribute, automatically leads to information gain.
• Designed entropy.
• Parallely designed information gain .
• Related this to message communication.