Monday, May 16, 2016

Decision Tree Classification using scikit-learn



Please visit Preparing-machine-learning-developing blog post if you haven't prepared your development environment yet.

First we have to load data from a dataset. We can use a dataset in our hand at this point or online dataset for this. Use following python method to load titanic.csv data file into Pandas[1] dataframe.

Here I have used my downloaded csv file. You can download that file from https://github.com/caesar0301/awesome-public-datasets/tree/master/Datasets or Goolge it


def load_data():
  df = pand.read_csv("/home/malintha/projects/ML/datasets/titanic.csv");
  return df

before you use Pandas functions you have to import that module.

import pandas as pand

Now we can print the first few rows of the dataframe using

print(df.head(), end = "\n\n")

And it will output

PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked

1
0
3
Braund, Mr. Owen Harris
male
22
1
0
A/5 21171
7.25

S

2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Thayer)
female
38
1
0
PC 17599
71.2833
C85
C

3
1
3
Heikkinen, Miss. Laina
female
26
0
0
STON/O2. 3101282
7.925

S

4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35
1
0
113803
53.1
C123
S


We can remove “Name”,  “Ticket” and “PassengerId”  features for the dataset as they are not much important features over other features. We can use Pandas ‘drop’ facility to remove column from a dataframe.

df.drop(['Name','Ticket',’PassengerId’], inplace=True, axis=1)

Next task is mapping nominal data into integers in order to create the model in scikit-learn.

Here we have 3 nominal feature in our dataset.
  1. Sex
  2. Cabin
  3. Embarked

We can replace the original values with integers with following code segment.


def map_nominal_to_integers(df):
  df_refined = df.copy()
  sex_types = df_refined['Sex'].unique()
  cabin_types = df_refined['Cabin'].unique()
  embarked_types = df_refined["Embarked"].unique()
  sex_types_to_int = {name: n for n, name in enumerate(sex_types)}
  cabin_types_to_int = {name: n for n, name in enumerate(cabin_types)}
  embarked_types_to_int = {name: n for n, name in enumerate(embarked_types)}
  df_refined["Sex"] = df_refined["Sex"].replace(sex_types_to_int)
  df_refined["Cabin"] = df_refined["Cabin"].replace(cabin_types_to_int)
  df_refined["Embarked"] = df_refined["Embarked"].replace(embarked_types_to_int)
  return df_refined

We have one more step to shape-up our dataset. If you look at the refined dataset carefully ,you may able to see there are “NaN” value for some of age values. We should replace this NaN with appropriate integer value. Pandas provide built-in function for this. I will use 0 as the replacement for NaN.

df["Age"].fillna(0, inplace=True)

Now we all set to build the decision tree from our refined dataset. We have to choose the features and the target value for the decision tree.

features = ['Pclass','Sex','Age','SibSp','Parch','Fare','Cabin','Embarked']
X = df[features]
Y = df["Survived"]

Here,X is the feature set and Y is the target set. Now we build the decision tree. For this you should import the scikit-learn decision tree into your python module.
from sklearn.tree import DecisionTreeClassifier

And build the decision tree with our feature set and the target set

dt = DecisionTreeClassifier(min_samples_split=20, random_state=9)
dt.fit(X,Y)

Now it is time to do a prediction with our trained decision tree. We can use sample feature data set and predict the target for that feature value set.

Z = [1,1,22.0,1,0,7.25,0,0]
print(dt.predict(Z))


No comments:

Post a Comment