Please visit Preparing-machine-learning-developing blog post if you haven't prepared your development environment yet.
First we have to load data from a dataset. We can use a dataset in our hand at this point or online dataset for this. Use following python method to load titanic.csv data file into Pandas[1] dataframe.
Here I have used my downloaded csv file. You can download that file from https://github.com/caesar0301/awesome-public-datasets/tree/master/Datasets or Goolge it
def load_data():
df = pand.read_csv("/home/malintha/projects/ML/datasets/titanic.csv");
return df
|
before you use Pandas functions you have to import that module.
import pandas as pand
|
Now we can print the first few rows of the dataframe using
print(df.head(), end = "\n\n")
|
And it will output
PassengerId
|
Survived
|
Pclass
|
Name
|
Sex
|
Age
|
SibSp
|
Parch
|
Ticket
|
Fare
|
Cabin
|
Embarked
| |
1
|
0
|
3
|
Braund, Mr. Owen Harris
|
male
|
22
|
1
|
0
|
A/5 21171
|
7.25
|
S
| ||
2
|
1
|
1
|
Cumings, Mrs. John Bradley (Florence Briggs Thayer)
|
female
|
38
|
1
|
0
|
PC 17599
|
71.2833
|
C85
|
C
| |
3
|
1
|
3
|
Heikkinen, Miss. Laina
|
female
|
26
|
0
|
0
|
STON/O2. 3101282
|
7.925
|
S
| ||
4
|
1
|
1
|
Futrelle, Mrs. Jacques Heath (Lily May Peel)
|
female
|
35
|
1
|
0
|
113803
|
53.1
|
C123
|
S
|
We can remove “Name”, “Ticket” and “PassengerId” features for the dataset as they are not much important features over other features. We can use Pandas ‘drop’ facility to remove column from a dataframe.
df.drop(['Name','Ticket',’PassengerId’], inplace=True, axis=1)
|
Next task is mapping nominal data into integers in order to create the model in scikit-learn.
Here we have 3 nominal feature in our dataset.
- Sex
- Cabin
- Embarked
We can replace the original values with integers with following code segment.
def map_nominal_to_integers(df):
df_refined = df.copy()
sex_types = df_refined['Sex'].unique()
cabin_types = df_refined['Cabin'].unique()
embarked_types = df_refined["Embarked"].unique()
sex_types_to_int = {name: n for n, name in enumerate(sex_types)}
cabin_types_to_int = {name: n for n, name in enumerate(cabin_types)}
embarked_types_to_int = {name: n for n, name in enumerate(embarked_types)}
df_refined["Sex"] = df_refined["Sex"].replace(sex_types_to_int)
df_refined["Cabin"] = df_refined["Cabin"].replace(cabin_types_to_int)
df_refined["Embarked"] = df_refined["Embarked"].replace(embarked_types_to_int)
return df_refined
|
We have one more step to shape-up our dataset. If you look at the refined dataset carefully ,you may able to see there are “NaN” value for some of age values. We should replace this NaN with appropriate integer value. Pandas provide built-in function for this. I will use 0 as the replacement for NaN.
df["Age"].fillna(0, inplace=True)
|
Now we all set to build the decision tree from our refined dataset. We have to choose the features and the target value for the decision tree.
features = ['Pclass','Sex','Age','SibSp','Parch','Fare','Cabin','Embarked']
X = df[features]
Y = df["Survived"]
|
Here,X is the feature set and Y is the target set. Now we build the decision tree. For this you should import the scikit-learn decision tree into your python module.
from sklearn.tree import DecisionTreeClassifier
|
And build the decision tree with our feature set and the target set
dt = DecisionTreeClassifier(min_samples_split=20, random_state=9)
dt.fit(X,Y)
|
Now it is time to do a prediction with our trained decision tree. We can use sample feature data set and predict the target for that feature value set.
Z = [1,1,22.0,1,0,7.25,0,0]
print(dt.predict(Z))
|
Please find source code from my github location https://github.com/malinthaadikari/machine-learning/tree/master/DecisionTreeClassification