K-Means clustering is a popular unsupervised classification algorithm. In simple terms we have unlabeled dataset with us. Unlabeled dataset means we have a dataset but we don't have any clue about how we are going to categorized each row in the dataset. Following is an example few rows from unlabeled dataset about crime data in USA. Here we have one row for each state and set of features related to crime information. We have this dataset with us but we don't know what to do this with this data. One thing we can do is finding similarities of the states. In other way we can try to prepare few buckets and put states into those buckets based on the similarities in crime information.
State
|
Murder
|
Assault
|
UrbanPop
|
Rape
| |
Alabama
|
13.2
|
236
|
58
|
21.2
| |
Alaska
|
10
|
263
|
48
|
44.5
| |
Arizona
|
8.1
|
294
|
80
|
31
| |
Arkansas
|
8.8
|
190
|
50
|
19.5
| |
California
|
9
|
276
|
91
|
40.6
|
Now let's discuss how we can implement K-Means cluster for our dataset with Scikit-learn. You can download USA crime dataset from my github location.
Import KMeans from Scikit-learn.
from sklearn.cluster import KMeans
|
Load your datafile into Pandas dataframe
df = Utils.get_dataframe("crime_data.csv")
|
Create KMean model providing required number of clusters. Here I have defined required number of clusters to 5
KMeans_model = KMeans(n_clusters=5, random_state=1)
|
Refine your data removing non-numeric data, unimportant features..etc.
df.drop(['crime$cluster'], inplace=True, axis=1)
df.rename(columns={df.columns[0]: 'State'}, inplace=True)
|
Select only numeric data in your dataset.
numeric_columns = df._get_numeric_data()
|
Train KMeans-clustering model
KMeans_model.fit(numeric_columns)
|
Now you can see the label of each row in your training dataset.
labels = KMeans_model.labels_
print(labels)
|
Predic new state’s crime cluster as follows
print(KMeans_model.predict([[15, 236, 58, 21.2]]))
|
No comments:
Post a Comment