Fundamentals of Classification and Regression Trees (CART)
You may have been using Tree models for a long time or a newbie but, have you ever wonder how actually it works and how it is different from other algorithms? Here, I share a brief of my understandings.
CART is also a predictive model which helps to find a variable based on other labeled variables. To be more clear the tree models predict the outcome by asking a set of if-else questions. There are two major advantages in using tree models,
- They are able to capture the non-linearity in the dataset.
- No need for Standardization of data when using tree models. Because they don’t calculate any euclidean distance or other measuring factors between data. Just if-else.
Nuts and Bolts of Trees
Above shown is an image of Decision Tree Classifier, each round is known as Nodes. Each node will have an if-else clause based on a labeled variable. Based on that question each instance of input will be directed/routed to a specific leaf-node which will tell the final prediction. There are three types of nodes,
- Root Node: doesn’t have any parent node, and gives two children nodes based on the question
- Internal Node: it will have a parent node, and gives two children nodes
- Leaf Node: it will also have a parent node, but won’t have any children nodes
The number of levels of we has is known as the max_depth. In the above diagram max_depth = 3. As the max_depth increases, the model complexity also will increase. While we training if we increase it, the training error will always go down or will remain the same. But it may sometimes increase the testing error. So we have to be choosy when selecting the max_depth for a model.
Another interesting factor about Node is Information gain(IR). This is a criterion used to measure the purity of a Node. Purity is measured based on how clever a node can split items. Let’s say you are at a Node and you want to go either left or right. But you have items belongs to both classes at the same amount (50–50) at each node. Then the purity of both classes is low because you don’t know which direction to choose. One has to be higher than the other to make a decision. this is measured using IR,
As the name itself says, the goal of CART is to predict to which class an input instance belongs to based on its labeled values. To accomplish this it will make Decision Regions using Decision Boundaries. Imagine we have a 2D dataset,
like this, it will separate our multidimensional dataset into Decision Regions based on the if-else questions at each node. CART models can find more accurate decision regions than linear models. And the decision regions by CART are typically rectangular shaped because, only one feature involved at each node in decision making. You can visualize it below,
I think it is enough of introductions, let’s see some examples on how to build CART models on Scikit learn.
Classification Tree
#use a seed value for reusability
SEED = 1 # Import DecisionTreeClassifier from sklearn.tree
from sklearn.tree import DecisionTreeClassifier# Instantiate a DecisionTreeClassifier
# You can specify other parameters like criterion refer sklearn documentation for Decision tree. or try dt.get_params()dt = DecisionTreeClassifier(max_depth=6, random_state=SEED)# Fit dt to the training set
dt.fit(X_train, y_train)# Predict test set labels
y_pred = dt.predict(X_test)# Import accuracy_score
from sklearn.metrics import accuracy_score# Predict test set labels
y_pred = dt.predict(X_test)# Compute test set accuracy
acc = accuracy_score(y_pred, y_test)
print("Test set accuracy: {:.2f}".format(acc))
Regression Tree
# Import DecisionTreeRegressor from sklearn.tree
from sklearn.tree import DecisionTreeRegressor# Instantiate dt
dt = DecisionTreeRegressor(max_depth=8,
min_samples_leaf=0.13,
random_state=3)# Fit dt to the training set
dt.fit(X_train, y_train)# Predict test set labels
y_pred = dt.predict(X_test)# Compute mse
mse = MSE(y_test, y_pred)# Compute rmse_lr
rmse = mse**(1/2)# Print rmse_dt
print('Regression Tree test set RMSE: {:.2f}'.format(rmse_dt))
Hope this article is useful, if you have any discussions or suggestions please leave a private note.