Software is hard: Decision Tree Classification using scikit-learn

Monday, May 16, 2016

Decision Tree Classification using scikit-learn

Please visit Preparing-machine-learning-developing blog post if you haven't prepared your development environment yet.

First we have to load data from a dataset. We can use a dataset in our hand at this point or online dataset for this. Use following python method to load titanic.csv data file into Pandas[1] dataframe.

Here I have used my downloaded csv file. You can download that file from https://github.com/caesar0301/awesome-public-datasets/tree/master/Datasets or Goolge it

def load_data():

df = pand.read_csv("/home/malintha/projects/ML/datasets/titanic.csv");

return df

before you use Pandas functions you have to import that module.

import pandas as pand

Now we can print the first few rows of the dataframe using

print(df.head(), end = "\n\n")

And it will output

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.25		S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	0	STON/O2. 3101282	7.925		S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	0	113803	53.1	C123	S

We can remove “Name”, “Ticket” and “PassengerId” features for the dataset as they are not much important features over other features. We can use Pandas ‘drop’ facility to remove column from a dataframe.

df.drop(['Name','Ticket',’PassengerId’], inplace=True, axis=1)

Next task is mapping nominal data into integers in order to create the model in scikit-learn.

Here we have 3 nominal feature in our dataset.

Sex
Cabin
Embarked

We can replace the original values with integers with following code segment.

def map_nominal_to_integers(df):

df_refined = df.copy()

sex_types = df_refined['Sex'].unique()

cabin_types = df_refined['Cabin'].unique()

embarked_types = df_refined["Embarked"].unique()

sex_types_to_int = {name: n for n, name in enumerate(sex_types)}

cabin_types_to_int = {name: n for n, name in enumerate(cabin_types)}

embarked_types_to_int = {name: n for n, name in enumerate(embarked_types)}

df_refined["Sex"] = df_refined["Sex"].replace(sex_types_to_int)

df_refined["Cabin"] = df_refined["Cabin"].replace(cabin_types_to_int)

df_refined["Embarked"] = df_refined["Embarked"].replace(embarked_types_to_int)

return df_refined

We have one more step to shape-up our dataset. If you look at the refined dataset carefully ,you may able to see there are “NaN” value for some of age values. We should replace this NaN with appropriate integer value. Pandas provide built-in function for this. I will use 0 as the replacement for NaN.

df["Age"].fillna(0, inplace=True)

Now we all set to build the decision tree from our refined dataset. We have to choose the features and the target value for the decision tree.

features = ['Pclass','Sex','Age','SibSp','Parch','Fare','Cabin','Embarked']

X = df[features]

Y = df["Survived"]

Here,X is the feature set and Y is the target set. Now we build the decision tree. For this you should import the scikit-learn decision tree into your python module.

from sklearn.tree import DecisionTreeClassifier

And build the decision tree with our feature set and the target set

dt = DecisionTreeClassifier(min_samples_split=20, random_state=9)

dt.fit(X,Y)

Now it is time to do a prediction with our trained decision tree. We can use sample feature data set and predict the target for that feature value set.

Z = [1,1,22.0,1,0,7.25,0,0]

print(dt.predict(Z))

Please find source code from my github location https://github.com/malinthaadikari/machine-learning/tree/master/DecisionTreeClassification

[1]. http://pandas.pydata.org/

No comments:

Decision Tree Classification using scikit-learn

Please visit Preparing-machine-learning-developing blog post if you haven't prepared your development environment yet.

Here I have used my downloaded csv file. You can download that file from https://github.com/caesar0301/awesome-public-datasets/tree/master/Datasets or Goolge it

def load_data():

df = pand.read_csv("/home/malintha/projects/ML/datasets/titanic.csv");

return df

before you use Pandas functions you have to import that module.

import pandas as pand

Now we can print the first few rows of the dataframe using

print(df.head(), end = "\n\n")

And it will output

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.25		S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	0	STON/O2. 3101282	7.925		S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	0	113803	53.1	C123	S

df.drop(['Name','Ticket',’PassengerId’], inplace=True, axis=1)

Next task is mapping nominal data into integers in order to create the model in scikit-learn.

Here we have 3 nominal feature in our dataset.

Sex
Cabin
Embarked

We can replace the original values with integers with following code segment.

def map_nominal_to_integers(df):

df_refined = df.copy()

sex_types = df_refined['Sex'].unique()

cabin_types = df_refined['Cabin'].unique()

embarked_types = df_refined["Embarked"].unique()

sex_types_to_int = {name: n for n, name in enumerate(sex_types)}

cabin_types_to_int = {name: n for n, name in enumerate(cabin_types)}

embarked_types_to_int = {name: n for n, name in enumerate(embarked_types)}

df_refined["Sex"] = df_refined["Sex"].replace(sex_types_to_int)

df_refined["Cabin"] = df_refined["Cabin"].replace(cabin_types_to_int)

df_refined["Embarked"] = df_refined["Embarked"].replace(embarked_types_to_int)

return df_refined

df["Age"].fillna(0, inplace=True)

Now we all set to build the decision tree from our refined dataset. We have to choose the features and the target value for the decision tree.

features = ['Pclass','Sex','Age','SibSp','Parch','Fare','Cabin','Embarked']

X = df[features]

Y = df["Survived"]

Here,X is the feature set and Y is the target set. Now we build the decision tree. For this you should import the scikit-learn decision tree into your python module.

from sklearn.tree import DecisionTreeClassifier

And build the decision tree with our feature set and the target set

dt = DecisionTreeClassifier(min_samples_split=20, random_state=9)

dt.fit(X,Y)

Now it is time to do a prediction with our trained decision tree. We can use sample feature data set and predict the target for that feature value set.

Z = [1,1,22.0,1,0,7.25,0,0]

print(dt.predict(Z))

Please find source code from my github location https://github.com/malinthaadikari/machine-learning/tree/master/DecisionTreeClassification

[1]. http://pandas.pydata.org/

Software is hard

Monday, May 16, 2016

Decision Tree Classification using scikit-learn

No comments:

Post a Comment

Decision Tree Classification using scikit-learn

No comments:

Post a Comment