Saturday, May 21, 2016

K-Means clustering with Scikit-learn


K-Means clustering is a popular unsupervised classification algorithm. In simple terms we have unlabeled dataset with us. Unlabeled dataset means we have a dataset but we don't have any clue about how we are going to categorized each row in the dataset. Following is an example few rows from unlabeled dataset about crime data in USA. Here we have one row for each state and set of features related to crime information. We have this dataset with us but we don't know what to do this with this data. One thing we can do is finding similarities of the states. In other way we can try to prepare few buckets and put states into those buckets based on the similarities in crime information.


State

Murder
Assault
UrbanPop
Rape
Alabama

13.2
236
58
21.2
Alaska

10
263
48
44.5
Arizona

8.1
294
80
31
Arkansas

8.8
190
50
19.5
California

9
276
91
40.6


Now let's discuss how we can implement K-Means cluster for our dataset with Scikit-learn. You can download USA crime dataset from my github location.


Import KMeans from Scikit-learn.


from sklearn.cluster import KMeans


Load your datafile into Pandas dataframe


df = Utils.get_dataframe("crime_data.csv")


Create KMean model providing required number of clusters. Here I have defined required number of clusters to 5


KMeans_model = KMeans(n_clusters=5, random_state=1)


Refine your data removing non-numeric data, unimportant features..etc.


df.drop(['crime$cluster'], inplace=True, axis=1)
df.rename(columns={df.columns[0]: 'State'}, inplace=True)


Select only numeric data in your dataset.


numeric_columns = df._get_numeric_data()


Train KMeans-clustering model


KMeans_model.fit(numeric_columns)


Now you can see the label of each row in your training dataset.


labels = KMeans_model.labels_
print(labels)


Predic new state’s crime cluster as follows


print(KMeans_model.predict([[15, 236, 58, 21.2]]))


No comments:

Post a Comment