August 17, 2021

Ensemble Method and Random Forest

Suppose you pose a complex question to thousands of random people, then aggregate their answers. In many cases, you will find that this aggregated answer is better than an expert’s answer. This is called the wisdom of the crowd. Similarly, if you aggregate the predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. A group of predictors is called an ensemble; thus, this technique is called Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble method.

As an example of an Ensemble method, you can train a group of Decision Tree classifiers, each on a different random subset of the training set. To make predictions, you obtain the predictions of all the individual trees, then predict the class that gets the most votes (see the last exercise in Chapter 6). Such an ensemble of Decision Trees is called a Random Forest, and despite its simplicity, this is one of the most powerful Machine Learning algorithms available today.

Let's see the Code

import numpy as np
import pandas as pd
df = pd.read_csv('titanic.csv')
df.head()

inputs = df.drop(['PassengerId','Pclass','Name','SibSp','Parch','Ticket','Cabin','Embarked','Survived'], axis='columns')
target= df['Survived']
inputs.head()

from sklearn.preprocessing import LabelEncoder
encoder_Sex = LabelEncoder()
inputs['Sex']= encoder_Sex.fit_transform(inputs['Sex'])
inputs.dropna()
inputs.head()
inputs.fillna('0')

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log = LogisticRegression()
rf = RandomForestClassifier()
svm = SVC()

voting_clf = VotingClassifier( estimators=[('lr', log),('forest', rf),('SVM', svm)], voting='hard')
voting_clf.fit(inputs,target)

from sklearn.metrics import accuracy_score

for clf in (log,rf,svm,voting_clf):
    clf.fit(inputs,target)

target_predict = clf.predict([[1,24.0,7.9250]]) 
clf.score(inputs,target)

Here we will Get our score as 0.8922558922558923

Bagging and Pasting

One way to get a diverse set of classifiers is to use very different training algorithms, as just discussed. Another approach is to use the same training algorithm for every predictor and train them on different random subsets of the training set. When sampling is performed with replacement, this method is called bagging (short for bootstrap aggregating). When sampling is performed without replacement, it is called pasting.

In other words, both bagging and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.

We will continue with the previous code

Scikit-Learn offers a simple API for both bagging and pasting with the class (or for regression). The following code trains an ensemble of 500 Decision Tree classifiers each is trained on 100 training instances randomly sampled from the training set with replacement (this is an example of Pasting, but if you want to use pasting instead, just set ). The parameter tells Scikit-Learn the number of CPU cores to use for training and predictions ( tells Scikit- Learn to use all available cores):

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100

,bootstrap=False, n_jobs=-1)

bag_clf.fit(inputs,target)

y_pred = bag_clf.predict([[1,24.0,7.9250]])
bag_clf.score(inputs,target)

Here we will Get our score as 0.819304152637486
For Bagging we just needs to change bootstrap=True

Search This Blog

All about Machine Learning

Ensemble Method and Random Forest

Let's see the Code

Here we will Get our score as 0.8922558922558923

Bagging and Pasting

We will continue with the previous code

Here we will Get our score as 0.819304152637486

Comments

Post a Comment

Popular Posts