Meta & resources

Probability, statistics and data analysis

Machine Learning: concepts & procedures

Machine Learning: fundamental algorithms

Machine Learning: model assessment

Artificial neural networks

Natural language processing

The computer science appendix

The mathematics appendix

Random forest

What is

A *Random (decision) Forest* is an ensemble learning method used for both for classification and regression. It is based on a brilliant yet simple idea: training a bunch of decision trees together to improve the accuracy reachable with of a single tree and address its problems, for example overfitting.

How does it work

In the case of a *classification* problem, the output of a forest is the mode of all classes spit out by individual trees, that is, the most frequent one; in the case of a *regression* problem, the output is the average of the predictions of individual trees.

The idea is fitting a number of decision trees, each on a sampled replica of the training set and use averaging to improve the accuracy and to control for overfitting. Each of these samples is chosen in such a way to have the same size as the original training set, but data is sampled with replacement so that effectively you are building replicas of the whole training set. This methodology is called *bagging* (short for bootstrap aggregating) and it's good way to decrease variance without increasing the bias.

Overfitting is reduced by bagging, if we compare to a single decision tree. Also, note that if we were to train each tree on the same set the trees would be correlated so bagging helps in de-correlating the trees.

â€‹

A parameter of the forest is the number of decision trees to use. Its optimal value can be found indeed by observing the *out-of-bag error* (OOB), which is the mean prediction error on each sample

$x_i$

using only the trees that did not have$x_i$

as a training sample. Looking at the OOB error is equivalent to running cross-validation. The optimal number of trees to use can be pinpointed by looking at the OOB error as a function of the number of trees: when the curve stabilises, it is a good indication that that quantity is sufficient.Ranking the importance of features with a Random Forest

A Random Forest can also be used to rank the importance each feature has in a task. Steps are

- 1.Fit a forest to the data and record the OOB error
- 2.The importance of the j-th feature is computed as
- randomly permute the values of feature in the training set (build a perturbed dataset)
- compute the OOB error again
- the difference in the OOB errors before and after the feature permutation, averaged and normalised by the standard deviation of these differences will quantify the importance of the feature

Features with a high importance score are the important predictors because it means that perturbing the dataset decreases the model performance.

References

- 1.T K Ho,
**Random Decision Forests**,*Proceedings of the Third International Conference on Document Analysis and Recognition*, 1 IEEE, 1995 - 2.

Last modified 1yr ago