Basics of Decision Tree
Parag Verma
9th April, 2021
Introduction to Decision Trees
Decision Tree is a Supervised Machine Learning algorithm which can be used for classification as well as regression problems.It uses a tree representation of the dataset at hand where each record/case is assigned to the node of a tree.The arrangement represents a tree like structure but with roots at the top.The tree is read from root to the leaf node which contains the classification of the instance
Simple Decision Tree
Lets consider a situation where we want to play tennis and we have a model to help us out here. Now depending on the weather condition, we will be using a decision tree to predict whether we can play or not.A simple illustration is represented below
- Tree classifies Sunday morning according to whether they are suitable for playing tennis
- Classification is made by traversing from root node to leaf node
- Each node represents an attribute
- Each branch descending from the node represents one of the possible values of the attribute
- Sunny, Overcast and Rain are the unique values for 'Outlook'
- If Outlook = Sunny,Humidity = High.This will be classified as a negative example
- The output expression are presented only with respect to yes cases:
- (Outlook = Sunny; Humidity = Normal )
- (Outlook = Overcast)
- (Outlook = Rain ; Wind = Weak)
ID3 Algorithm
Most implementations of decision trees are based on the ID3 algorithm.Salient features of the ID3 are:
- A decision tree is constructed from top to bottom
- Root node is determined by attribute that best splits the data
- A descendant of the root node is created using all possible values of the attribute and the entire process is repeated
- It is called greedy as it never goes back to the previous choices
Which attribute to pick ?
Attribute that is most useful for classifying examples should be used at the node.This is quantitatively determined by Information Gain criteria.To understand Information gain we should first study Entropy that measures purity of an arbitrary collection of examples
Entropy
To understand entropy,the following illustration should be kept in mind
- Given a sample S having some positive and negative examples Entropy is given by
- S = -(p+)(log2p+) - (p-)(log2p-)
- p+ represent proportion of +ve examples in S and p- represent proportion of -ve examples
- Suppose in a sample of 14 cases,there are 9 +ve and 5 –ve instances
- S = -(9/14) log2(9/14) - (5/14) log2 (5/14)
- S = 0.94
- S = 0 when all member belong to the same class
- S = 1 when the proportions are equal
- S is between 0 and 1 when sample contains unequal number of +ve and –ve cases
Information Gain
Measures the expected reduction in Entropy.Entropy measures the impurity in a sample. Higher the impurity, higher is the entropy of the system and conversely lower is the information gain.This is used by the algorithm to determine which attribute to pick for splitting.
As a general rule of thumb,If we partition the sample based on a attribute, then there should be reduction in entropy.The attribute that results in maximum reduction in entropy is used for splitting.
Gain (S,A) = Entropy (S) - ∑ (|Sv|/|S|)* (Entropy Sv) for all values in A(A is an attribute)
Dataset for our Analysis
The example that we discussed above about playing Tennis on a Sunday morning us based on the dataset shown below.There are a total of 9 instances where we can play tennis and there are 5 instances where we cant.
Select Attribute for root node
Now lets determine which attribute should be selected to first split the data. The attribute selected should result in reduction of entropy or conversely should result in maximum Information gain.To start, we have to calculate the reference Entropy of the system.The value calculated in the Entropy section as 0.94 is the reference entropy.
Wind as Root Node
Lets select wind as a candidate and calcualte information gain
Humidity as Root Node
Lets select Humidity as a candidate and calcualte information gain
Outlook as Root Node
Lets select Outlook as a candidate and calcualte information gain
Temp as Root Node
Lets select Temp as a candidate and calcualte information gain
The information gain for the attributes is summarised in the below table.As shown, Outlook has the highest Info gain value and hence should be taken as the root node
First Layer of Decision Tree
Based on selection of Outlook for root node, we can create the first layer of branching as shown below
Subsequent Branching
Lets say we want to further split the left most branch(Sunny) into subsequent branches.We will use the same criteria for selecting attributes and calculating their Information Gain.Lets look at the calculation and identify the attribute for shorlist here
So by picking Humidity for splitting the left most branch, the new tree would look like as shown below
As can be seen, the branching resulted in branches with all classes belonging to only category:either yes or no.This illustrates that the branching results in groups that are as homogenous as possible.
Final Note
In this blog we looked at general introduction of a Decision Tree, how it uses entropy and infomration gain to branch. We also used a working example to understand the splitting process in some detail.In the next blog, we will look at implementation of Decision Tree using R for credit risk dataset
Link to Previous Blogs
My Youtube Channel
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html