K-Means、MeanShift聚类实战与KNN对比

发表于： 2023-08-03 分类于：机器学习与AI

字数： 1510 阅读：≈ 4分钟浏览：评论：

使用 sklearn.cluster 模块可以对未标记的数据进行聚类。对于这类非监督的聚类算法来说，因为数据都是未标记的，所以模型训练完毕后得到的结果可能是与真实标记结果匹配不上的，需要手动矫正一下数据。对于 K-Means 算法来说，我们需要指定一个类别数量。Mean-shift 只需要根据指定的采样数量，自行计算搜索半径，不需要手动指定簇的数量（这里说的簇也就是类别，在 Sklearn 的文档里都叫做簇）。最后会对比一下有监督学习的 KNN 算法，看看效果。

1、采用 Kmeans 算法实现2D数据自动聚类，预测 V1=80,V2=60 数据类别；

2、计算预测准确率，完成结果矫正

3、采用 KNN、Meanshift 算法，重复步骤1-2

K-Means实现聚类

核心步骤就是：加载数据 - 训练模型 - 矫正结果

1#load the data
2import pandas as pd
3import numpy as np
4data = pd.read_csv('data.csv')
5data.head()

	V1	V2
0	2.072345	-3.241693
1	17.936710	15.784810
2	1.083576	7.319176
3	11.120670	14.406780
4	23.711550	2.557729

1#define X and y 因为不需要后面的类型标记
2X = data.drop(['labels'],axis=1)
3y = data.loc[:,'labels']
4# X.head()
5y.head()

0    0
1    0
2    0
3    0
4    0
Name: labels, dtype: int64

1# y有多少类别
2pd.value_counts(y)

labels
2    1156
1     954
0     890
Name: count, dtype: int64

1%matplotlib inline
2from matplotlib import pyplot as plt
3fig1 = plt.figure()
4plt.scatter(X.loc[:,'V1'],X.loc[:,'V2'])
5plt.title("un-labled data")
6plt.xlabel('V1')
7plt.ylabel('V2')
8plt.show()

这样看的话不太直观，原始数据的分类也表示出来：

 1fig2 = plt.figure()
 2label0 = plt.scatter(X.loc[:,'V1'][y==0],X.loc[:,'V2'][y==0])
 3label1 = plt.scatter(X.loc[:,'V1'][y==1],X.loc[:,'V2'][y==1])
 4label2 = plt.scatter(X.loc[:,'V1'][y==2],X.loc[:,'V2'][y==2])
 5
 6plt.title("labled data")
 7plt.xlabel('V1')
 8plt.ylabel('V2')
 9plt.legend((label0,label1,label2),('label0','label1','label2'))
10plt.show()

1# 导入并创建模型，这里指定簇的数量是3
2from sklearn.cluster import KMeans
3KM = KMeans(n_clusters=3,random_state=0)
4
5
6# 训练模型
7KM.fit(X)

KMeans(n_clusters=3, random_state=0)

 1# 获得中心点并展示在原带分类的图中
 2centers = KM.cluster_centers_
 3
 4fig3 = plt.figure()
 5label0 = plt.scatter(X.loc[:,'V1'][y==0],X.loc[:,'V2'][y==0])
 6label1 = plt.scatter(X.loc[:,'V1'][y==1],X.loc[:,'V2'][y==1])
 7label2 = plt.scatter(X.loc[:,'V1'][y==2],X.loc[:,'V2'][y==2])
 8
 9plt.title("labled data")
10plt.xlabel('V1')
11plt.ylabel('V2')
12plt.legend((label0,label1,label2),('label0','label1','label2'))
13plt.scatter(centers[:,0],centers[:,1])
14plt.show()

预测 V1=80, V2=60的时候的数据类别

1y_predict_test =  KM.predict([[20,20]])
2print(y_predict_test)

[2]

这里打印出来是 [2] ，但我们并不知道模型的 [2] 表示的是哪一类。

所以我们基于此模型，输入原数据看看预测的结果是什么：

1#predict based on training data
2y_predict = KM.predict(X)
3print(pd.value_counts(y_predict),pd.value_counts(y))

0    1149
1     952
2     899
Name: count, dtype: int64 labels
2    1156
1     954
0     890
Name: count, dtype: int64

可以看到预测的 [2] 类对应着原数据的 [0] 类，预测的 [0] 类对应着原数据的 [2] 类，预测的 [1] 类对应着原数据的 [1] 类，所以接下来要做的事情就是校正结果。

1from sklearn.metrics import accuracy_score
2accuracy = accuracy_score(y,y_predict)
3print(accuracy)

0.31966666666666665

不经过校正的数据，只有0.31的准确率，这里有0.31是因为预测的 [1] 类还是对应着原数据的 [1] 类，完全有可能只有0.01以下的准确度，所以只要校正结果，就能得到较高的正确率，在此之前先看看预测结果与原数据的对比：

 1#visualize the data and results
 2fig4 = plt.subplot(121)
 3label0 = plt.scatter(X.loc[:,'V1'][y_predict==0],X.loc[:,'V2'][y_predict==0])
 4label1 = plt.scatter(X.loc[:,'V1'][y_predict==1],X.loc[:,'V2'][y_predict==1])
 5label2 = plt.scatter(X.loc[:,'V1'][y_predict==2],X.loc[:,'V2'][y_predict==2])
 6
 7plt.title("predicted data")
 8plt.xlabel('V1')
 9plt.ylabel('V2')
10plt.legend((label0,label1,label2),('label0','label1','label2'))
11plt.scatter(centers[:,0],centers[:,1])
12
13fig5 = plt.subplot(122)
14label0 = plt.scatter(X.loc[:,'V1'][y==0],X.loc[:,'V2'][y==0])
15label1 = plt.scatter(X.loc[:,'V1'][y==1],X.loc[:,'V2'][y==1])
16label2 = plt.scatter(X.loc[:,'V1'][y==2],X.loc[:,'V2'][y==2])
17
18plt.title("labled data")
19plt.xlabel('V1')
20plt.ylabel('V2')
21plt.legend((label0,label1,label2),('label0','label1','label2'))
22plt.scatter(centers[:,0],centers[:,1])
23plt.show()

校正数据：

 1#correct the results
 2y_corrected = []
 3for i in y_predict:
 4    if i==0:
 5        y_corrected.append(2)
 6    elif i==2:
 7        y_corrected.append(0)
 8    else:
 9        y_corrected.append(1)
10print(pd.value_counts(y_corrected),pd.value_counts(y))

2    1149
1     952
0     899
Name: count, dtype: int64 labels
2    1156
1     954
0     890
Name: count, dtype: int64

现在类别对应上了，看看评估分数高达0.997

1print(accuracy_score(y,y_corrected))

0.997

1y_corrected = np.array(y_corrected)
2print(type(y_corrected))

<class 'numpy.ndarray'>

再看看预测结果与原数据的对比图，现在就对上了：

 1fig6 = plt.subplot(121)
 2label0 = plt.scatter(X.loc[:,'V1'][y_corrected==0],X.loc[:,'V2'][y_corrected==0])
 3label1 = plt.scatter(X.loc[:,'V1'][y_corrected==1],X.loc[:,'V2'][y_corrected==1])
 4label2 = plt.scatter(X.loc[:,'V1'][y_corrected==2],X.loc[:,'V2'][y_corrected==2])
 5
 6plt.title("corrected data")
 7plt.xlabel('V1')
 8plt.ylabel('V2')
 9plt.legend((label0,label1,label2),('label0','label1','label2'))
10plt.scatter(centers[:,0],centers[:,1])
11
12fig7 = plt.subplot(122)
13label0 = plt.scatter(X.loc[:,'V1'][y==0],X.loc[:,'V2'][y==0])
14label1 = plt.scatter(X.loc[:,'V1'][y==1],X.loc[:,'V2'][y==1])
15label2 = plt.scatter(X.loc[:,'V1'][y==2],X.loc[:,'V2'][y==2])
16
17plt.title("labled data")
18plt.xlabel('V1')
19plt.ylabel('V2')
20plt.legend((label0,label1,label2),('label0','label1','label2'))
21plt.scatter(centers[:,0],centers[:,1])
22plt.show()

KNN与K-Means对比

让我们用 KNN 来训练一下模型对比看一下效果：

1#establish a KNN model
2from sklearn.neighbors import KNeighborsClassifier
3KNN = KNeighborsClassifier(n_neighbors=3)
4KNN.fit(X,y)

KNeighborsClassifier(n_neighbors=3)

用训练好的模型预测 V1=80, V2=60的时候的数据类别：

1#predict based on the test data V1=80, V2=60
2y_predict_knn_test = KNN.predict([[80,60]])
3y_predict_knn = KNN.predict(X)
4print(y_predict_knn_test)
5print('knn accuracy:',accuracy_score(y,y_predict_knn))

[2]
knn accuracy: 1.0

1print(pd.value_counts(y_predict_knn),pd.value_counts(y))

2    1156
1     954
0     890
Name: count, dtype: int64 labels
2    1156
1     954
0     890
Name: count, dtype: int64

可以看到，KNN 因为训练的时候已经知道了数据的类别符号，所以结果也是能对上的，正确率达到了100%。

 1fig6 = plt.subplot(121)
 2label0 = plt.scatter(X.loc[:,'V1'][y_predict_knn==0],X.loc[:,'V2'][y_predict_knn==0])
 3label1 = plt.scatter(X.loc[:,'V1'][y_predict_knn==1],X.loc[:,'V2'][y_predict_knn==1])
 4label2 = plt.scatter(X.loc[:,'V1'][y_predict_knn==2],X.loc[:,'V2'][y_predict_knn==2])
 5
 6plt.title("knn results")
 7plt.xlabel('V1')
 8plt.ylabel('V2')
 9plt.legend((label0,label1,label2),('label0','label1','label2'))
10plt.scatter(centers[:,0],centers[:,1])
11
12fig7 = plt.subplot(122)
13label0 = plt.scatter(X.loc[:,'V1'][y==0],X.loc[:,'V2'][y==0])
14label1 = plt.scatter(X.loc[:,'V1'][y==1],X.loc[:,'V2'][y==1])
15label2 = plt.scatter(X.loc[:,'V1'][y==2],X.loc[:,'V2'][y==2])
16
17plt.title("labled data")
18plt.xlabel('V1')
19plt.ylabel('V2')
20plt.legend((label0,label1,label2),('label0','label1','label2'))
21plt.scatter(centers[:,0],centers[:,1])
22plt.show()

MeanShift实现聚类

1# MeanShift设置采样数量，可以自动获得搜索区域大小
2from sklearn.cluster import MeanShift,estimate_bandwidth
3#obtain the bandwidth
4bw = estimate_bandwidth(X,n_samples=500)
5print(bw)

30.84663454820215

1#establish the meanshift model-un-supervised model
2ms = MeanShift(bandwidth=bw)
3ms.fit(X)

MeanShift(bandwidth=30.84663454820215)

通过查看各个类别的的数量也能发现，也是 MeanShift 也是需要进行数据校正的，因为 MeanShift 也是无监督学习，处理的都是未标记的数据：

1y_predict_ms = ms.predict(X)
2print(pd.value_counts(y_predict_ms),pd.value_counts(y))

0    1149
1     952
2     899
Name: count, dtype: int64 labels
2    1156
1     954
0     890
Name: count, dtype: int64

 1fig6 = plt.subplot(121)
 2label0 = plt.scatter(X.loc[:,'V1'][y_predict_ms==0],X.loc[:,'V2'][y_predict_ms==0])
 3label1 = plt.scatter(X.loc[:,'V1'][y_predict_ms==1],X.loc[:,'V2'][y_predict_ms==1])
 4label2 = plt.scatter(X.loc[:,'V1'][y_predict_ms==2],X.loc[:,'V2'][y_predict_ms==2])
 5
 6plt.title("ms results")
 7plt.xlabel('V1')
 8plt.ylabel('V2')
 9plt.legend((label0,label1,label2),('label0','label1','label2'))
10plt.scatter(centers[:,0],centers[:,1])
11
12fig7 = plt.subplot(122)
13label0 = plt.scatter(X.loc[:,'V1'][y==0],X.loc[:,'V2'][y==0])
14label1 = plt.scatter(X.loc[:,'V1'][y==1],X.loc[:,'V2'][y==1])
15label2 = plt.scatter(X.loc[:,'V1'][y==2],X.loc[:,'V2'][y==2])
16
17plt.title("labled data")
18plt.xlabel('V1')
19plt.ylabel('V2')
20plt.legend((label0,label1,label2),('label0','label1','label2'))
21plt.scatter(centers[:,0],centers[:,1])
22plt.show()

 1#correct the results
 2y_corrected_ms = []
 3for i in y_predict_ms:
 4    if i==0:
 5        y_corrected_ms.append(2)
 6    elif i==1:
 7        y_corrected_ms.append(1)
 8    else:
 9        y_corrected_ms.append(0)
10print(pd.value_counts(y_corrected_ms),pd.value_counts(y))

2    1149
1     952
0     899
Name: count, dtype: int64 labels
2    1156
1     954
0     890
Name: count, dtype: int64

1#convert the results to numpy array
2y_corrected_ms = np.array(y_corrected_ms)
3print(type(y_corrected_ms))

<class 'numpy.ndarray'>

看看数据校正后的结果：

 1fig6 = plt.subplot(121)
 2label0 = plt.scatter(X.loc[:,'V1'][y_corrected_ms==0],X.loc[:,'V2'][y_corrected_ms==0])
 3label1 = plt.scatter(X.loc[:,'V1'][y_corrected_ms==1],X.loc[:,'V2'][y_corrected_ms==1])
 4label2 = plt.scatter(X.loc[:,'V1'][y_corrected_ms==2],X.loc[:,'V2'][y_corrected_ms==2])
 5
 6plt.title("ms corrected results")
 7plt.xlabel('V1')
 8plt.ylabel('V2')
 9plt.legend((label0,label1,label2),('label0','label1','label2'))
10plt.scatter(centers[:,0],centers[:,1])
11
12fig7 = plt.subplot(122)
13label0 = plt.scatter(X.loc[:,'V1'][y==0],X.loc[:,'V2'][y==0])
14label1 = plt.scatter(X.loc[:,'V1'][y==1],X.loc[:,'V2'][y==1])
15label2 = plt.scatter(X.loc[:,'V1'][y==2],X.loc[:,'V2'][y==2])
16
17plt.title("labled data")
18plt.xlabel('V1')
19plt.ylabel('V2')
20plt.legend((label0,label1,label2),('label0','label1','label2'))
21plt.scatter(centers[:,0],centers[:,1])
22plt.show()

Reference

scikit-learn 2.3. 聚类

scikit-learn 最近邻分类