An introduction to decision tree

5 min readSep 18, 2022

A decision tree is actually a Boolean function, its output can be either ‘0 or 1’ or ‘-1 or +1’ or ‘-1, 0 or +1’. The size of a decision tree is equal to the number of nodes present in it and its depth is equal to the length of the longest path from top to root.

Error rate:

A training set is always a labeled example given to a model, the more trained the model is less will be its error rate.

Training samples = { set } = { (X¹, Y¹), (X², Y²),…. (X^n, Y^m) }
Where ’n’ denotes the number of features of a training sample and ‘m’ denotes the size of the training sample.

Error rate/Training error is mathematically equal to ‘Number of Samples model got wrong while training / Total size of sample’.

The root of the tree and error rate:

By saying root, we actually mean the next node connected to the present node. The root of X¹ means, the output of X¹ which is to be passed to the next node/root.

Generally, root means the final output produced by all the nodes of a tree. I.e., the output of the entire model.

The potential function (Ø, phi) is used to decide what value must be placed at the root of the tree, Value at its root is the actual output of the model.

Ø(a) = min(a, 1 — a)

Suppose, we provide a model with a ‘10’ number of positive examples and a ‘5’ number of negative examples. In order to calculate the error rate when Y = 0 in node X¹ is,

To get Y = 0, we must get 10 positive examples wrong. Since, error rate is (Number of samples got wrong/total samples) = 10/(10 + 5) = 2/3

And, Ø(2/3) = min(2/3, 1/3) = 1/3

This is the error rate of the trivial decision tree with a single node of Y = 0. The error rate is 1/3.

Generally, error rate = Probability ( [x,y]~{set}, Y = 0 )

Eg 2: Suppose in a decision tree’s node/leaf ‘X¹’, calculate the probability of Y = 0. If we supply it with ‘12’ positive and ‘4’ negative examples.

Ø To get Y = 0, we must get ‘12’ positive examples wrong. So, rate =
12/16 = 1/4

And finally using potential function, Ø(1/4) = min(1/4, 3/4) = 1/4.

Such an algorithm for calculating error rate is usually based on the probability of one literal, but if we are not sure about the number of samples,

Then, the new error rate =
Probability( [x,y]~{set}, X¹ = 0 ). Ø Probability (( [x,y]~{set}, (Y = 0/ X¹ = 0 ) + Probability( [x,y]~{set}, X¹ = 1 ). Ø Probability (( [x,y]~{set}, (Y = 0/ X¹ = 1 )

So, the maximum gain of X¹ is calculated as the old error rate using X¹ — the new error rate using X¹.

We always put the output of the highest gain in the next node/root of the tree.

Deep into potential function, Gene Function:

The structure of a tree is determined by its potential function, Ø (phi). Potential function, Ø(a) = min( a, 1 — a) and it corresponds to the error rate of the model, it’s usually represented as ‘Ø¹’.

Gene function of Ø(a) = 2.a.( 1 — a), and it’s represented as ‘Ø²’. Actually, gene function is just an upper bound of potential function or ‘Ø²’ is an upper bound of ‘Ø¹’. The smaller it’s value, the smaller will be the error in training.

Calculation and examples:

Sample model, S =
Ø +=============================+
Ø + X¹ | X² | +ve Eg. | -ve Eg +
Ø +=============================+
Ø + 0 | 0 | 1 | 1 +
Ø + 0 | 1 | 2 | 1 +
Ø + 1 | 0 | 3 | 1. +
Ø + 1 | 1. | 4 | 2. +
Ø +=============================+

Ø(a) = 2 . a . (1 — a)

1. Calculate Ø( probability ( S [ -ve Eg ]):

Total number of negative examples = 5
Total number of examples = 15
Probability of negative example = 1/3
Now, Ø(a) = 2 . 1/3 . 2/3 = 4/9.

This is the potential function of the
trivial tree, 4/9.

In order to decide if X¹ or X² to be placed at the root of the tree through manual function, we must compute them individually.

First for X¹ :

Probability(X¹ = 0). Ø (Pr. ( -ve Eg/ X¹ = 0)) + Pr.(X¹ = 1). Ø (Pr. ( -ve Eg/ X¹ = 1))

1. Probability(X¹ = 0) =
Number of samples where X¹ is 0 = 5_
Total Number of points 15 I.e., 1/3.

2. Probability( -ve Eg/X¹ = 0) =
Number of -ve Eg where X¹ is 0 = _2_
Total number of Eg where X¹ is 0 5
Ø(2/5) = 2. 2/5 . 3/5 = 12/15

3. Probability(X¹ = 1) = 2/3
4. Probability( -ve Eg/X¹ = 1) = 3/10
Ø( 3/10) = 2. 3/10. 7/10 = 21/50

Now, the entire value gets equal to
1/3. 2/5 + 12/15. 21/50 = 11/25

Comparing this value with the 11/25 with 4/9 ( earlier obtained ), we can say that 11/25 is slightly smaller than 4/9.

Now computing the same thing for X² :

Probability(X² = 0). Ø (Probability( -ve Eg/ X² = 0)) + Probability(X² = 1). Ø (Probability( -ve Eg/ X² = 1))
1. Probability(X² = 0) =
Number of samples where X² is 0 = 6_
Total Number of points 15, I.e., 2/5.

2. Probability( -ve Eg/X² = 0) =
Number of -ve Eg where X² is 0 = 2
Total number of Eg where X² is 0 6. I.e., 1/3.
Ø(1/3) = 2. 1/3. 2/3 = 4/9

3. Probability(X² = 1) = 3/5
4. Probability( -ve Eg/X² = 1) = 1/3
Ø( 3/10) = 2. 3/9. 7/9 = 4/9

Now, the entire value gets equal to
2/5. 4/9 + 3/5.4/9 = 4/9

Comparing this value, we get actually the same value of 4/9 we obtained earlier.

Now, if we calculate gain:
Gain = Older value — Newer Value

Gain of X¹ = 4/9–11/25 > 0
Gain of X² = 4/9–4/9 = 0

So, the literal we must pick to put in the root is X¹ since picking X¹ will give our model some gain.
The more the gain less will be error rate.

When does our algorithm end ?

Since our algorithm is recursion based it keeps on running even if we get maximum gain for any literal.

To make an endpoint for our algorithm, we have many ways:

1. Stop the algorithm when a gain for any literal becomes extremely small. It’s the most common way for ending an algorithm.

2. By pruning, pruning is the process of removing certain random nodes from the tree. It helps us to re-change entire data and recalculates the output

3. By smart construction of a decision tree:

○ By constructing a tree using the algorithm of random forests.

In a Random forest model of a tree, we construct much small independent decision tree and Take the most common output among all the small trees.

We can create a small tree by randomly subsampling the original data sample along with random features. So each small tree has randomly selected points along with randomly chosen features.

An introduction to decision tree

Error rate:

The root of the tree and error rate:

Calculation and examples:

When does our algorithm end ?

Written by Viswam PC