Ensemble Method and Random Forest
Suppose you pose a complex question to thousands of random people, then aggregate their answers. In many cases, you will find that this aggregated answer is better than an expert’s answer. This is called the wisdom of the crowd. Similarly, if you aggregate the predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. A group of predictors is called an ensemble; thus, this technique is called Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble method.
As an example of an Ensemble method, you can train a group of Decision Tree classifiers, each on a different random subset of the training set. To make predictions, you obtain the predictions of all the individual trees, then predict the class that gets the most votes (see the last exercise in Chapter 6). Such an ensemble of Decision Trees is called a Random Forest, and despite its simplicity, this is one of the most powerful Machine Learning algorithms available today.
Let's see the Code
import numpy as np
import pandas as pd
df = pd.read_csv('titanic.csv')
df.head()
inputs = df.drop(['PassengerId','Pclass','Name','SibSp','Parch','Ticket','Cabin','Embarked','Survived'], axis='columns')
target= df['Survived']
inputs.head()
from sklearn.preprocessing import LabelEncoderencoder_Sex = LabelEncoder()inputs['Sex']= encoder_Sex.fit_transform(inputs['Sex'])inputs.dropna()inputs.head()inputs.fillna('0')
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
log = LogisticRegression()
rf = RandomForestClassifier()
svm = SVC()
voting_clf = VotingClassifier( estimators=[('lr', log),('forest', rf),('SVM', svm)], voting='hard')
voting_clf.fit(inputs,target)
from sklearn.metrics import accuracy_score
for clf in (log,rf,svm,voting_clf):
clf.fit(inputs,target)
target_predict = clf.predict([[1,24.0,7.9250]])
clf.score(inputs,target)
Here we will Get our score as 0.8922558922558923
Bagging and Pasting
One way to get a diverse set of classifiers is to use very different training algorithms, as just discussed. Another approach is to use the same training algorithm for every predictor and train them on different random subsets of the training set. When sampling is performed with replacement, this method is called bagging (short for bootstrap aggregating). When sampling is performed without replacement, it is called pasting.
We will continue with the previous code
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100
,bootstrap=False, n_jobs=-1)
bag_clf.fit(inputs,target)
y_pred = bag_clf.predict([[1,24.0,7.9250]])
bag_clf.score(inputs,target)
Here we will Get our score as 0.819304152637486
For Bagging we just needs to change bootstrap=True




Comments
Post a Comment