Software is hard: K-Means clustering with Scikit-learn

Saturday, May 21, 2016

K-Means clustering with Scikit-learn

K-Means clustering is a popular unsupervised classification algorithm. In simple terms we have unlabeled dataset with us. Unlabeled dataset means we have a dataset but we don't have any clue about how we are going to categorized each row in the dataset. Following is an example few rows from unlabeled dataset about crime data in USA. Here we have one row for each state and set of features related to crime information. We have this dataset with us but we don't know what to do this with this data. One thing we can do is finding similarities of the states. In other way we can try to prepare few buckets and put states into those buckets based on the similarities in crime information.

State	Murder	Assault	UrbanPop	Rape
Alabama	13.2	236	58	21.2
Alaska	10	263	48	44.5
Arizona	8.1	294	80	31
Arkansas	8.8	190	50	19.5
California	9	276	91	40.6

Now let's discuss how we can implement K-Means cluster for our dataset with Scikit-learn. You can download USA crime dataset from my github location.

Import KMeans from Scikit-learn.

from sklearn.cluster import KMeans

Load your datafile into Pandas dataframe

df = Utils.get_dataframe("crime_data.csv")

Create KMean model providing required number of clusters. Here I have defined required number of clusters to 5

KMeans_model = KMeans(n_clusters=5, random_state=1)

Refine your data removing non-numeric data, unimportant features..etc.

df.drop(['crime$cluster'], inplace=True, axis=1)

df.rename(columns={df.columns[0]: 'State'}, inplace=True)

Select only numeric data in your dataset.

numeric_columns = df._get_numeric_data()

Train KMeans-clustering model

KMeans_model.fit(numeric_columns)

Now you can see the label of each row in your training dataset.

labels = KMeans_model.labels_

print(labels)

Predic new state’s crime cluster as follows

print(KMeans_model.predict([[15, 236, 58, 21.2]]))

No comments:

K-Means clustering with Scikit-learn

State	Murder	Assault	UrbanPop	Rape
Alabama	13.2	236	58	21.2
Alaska	10	263	48	44.5
Arizona	8.1	294	80	31
Arkansas	8.8	190	50	19.5
California	9	276	91	40.6

Now let's discuss how we can implement K-Means cluster for our dataset with Scikit-learn. You can download USA crime dataset from my github location.

Import KMeans from Scikit-learn.

from sklearn.cluster import KMeans

Load your datafile into Pandas dataframe

df = Utils.get_dataframe("crime_data.csv")

Create KMean model providing required number of clusters. Here I have defined required number of clusters to 5

KMeans_model = KMeans(n_clusters=5, random_state=1)

Refine your data removing non-numeric data, unimportant features..etc.

df.drop(['crime$cluster'], inplace=True, axis=1)

df.rename(columns={df.columns[0]: 'State'}, inplace=True)

Select only numeric data in your dataset.

numeric_columns = df._get_numeric_data()

Train KMeans-clustering model

KMeans_model.fit(numeric_columns)

Now you can see the label of each row in your training dataset.

labels = KMeans_model.labels_

print(labels)

Predic new state’s crime cluster as follows

print(KMeans_model.predict([[15, 236, 58, 21.2]]))

Software is hard

Saturday, May 21, 2016

K-Means clustering with Scikit-learn

No comments:

Post a Comment

K-Means clustering with Scikit-learn

No comments:

Post a Comment