通过线性函数将数据划分为簇

问题描述 投票:0回答:1

我有很多行,它们在图表上形成了三条明显的线。

样本数据

line_position,queue_number,real_seq
0,2280,41171
55,3375,24999
55,733,11506
45,3939,29185
80,1522,14121
70,1022,10953
15,4687,24235
55,2072,14898
55,1755,12913
75,2014,17938
50,2178,14281
5,5612,36370
0,5689,38861
5,8023,40942
65,2777,21954
15,7384,39900
30,5241,35130
40,3554,19147
20,6663,37397
5,5134,28694
5,5273,32029
65,514,12791
10,7560,39851
25,6450,36909
50,1130,27140
20,4430,23025
0,5685,37094
0,5949,40905
20,6842,37547
5,5278,31231
15,7367,39031
40,4340,31534
35,3680,19437
5,5236,30761
5,2104,29053
0,5947,40685
45,3128,17475
40,4386,31495
50,3922,31394
15,7307,38805
55,3403,26704
70,2604,20509
5,5574,34118
55,733,11668
20,6663,37223
25,6430,37171
55,1815,12632
60,3094,23472
30,5798,36262
30,5293,34687
20,6554,37454
35,4767,34735
40,4411,31716
30,5427,35581
40,3350,18316
50,1075,14794
85,948,13668
80,1601,16079
5,4868,26220
20,6554,37075
5,2100,33351
75,666,5799
50,980,15290
95,387,7418
30,1715,20606
15,1980,25981
35,4759,30730
20,4603,24254
5,5059,28033
5,5257,32243
45,1308,16861
0,5849,38680
85,414,6927
0,2148,35148
70,2551,21015
35,4581,32535
80,561,6001
0,5672,35715
5,5152,33120
35,4984,34437
55,3574,27528
35,3762,19995
30,5798,39146
0,5911,40312
85,387,5917
35,4581,35933
55,754,11654
40,3610,25147
0,2252,39270
5,2042,34883
0,6032,41330
80,1826,20158
30,4075,21742
10,7517,40283
45,3029,19383
30,4933,32675
40,1479,21945
10,4826,25687
25,6380,37256
75,364,8215

我需要将这些行分为三个簇。 我尝试过使用

sklearn.cluster
中的多种聚类算法(AgglomerativeClustering、Birch、DBSCAN、KMeans、MiniBatchKMeans、MeanShift),但正如预期的那样,这些算法并未按照我需要的方式划分数据。

查看图表,最简单的似乎是“绘制”两条线,这会将我的数据分成三个集群。 但是,我没有找到任何现成的工具可以让我做到这一点。

因此,我想知道是否有任何常见的数据科学库提供了这种可能性?或者也许有更好的方法将我的数据分为三个集群?

python pandas scikit-learn data-science cluster-analysis
1个回答
0
投票

这就是标准化后样本数据的样子。您共享的样本足够小,任何模型都无法在集群上进行检测。如果您的数据集足够大,并且您对数据进行标准化并选择正确的调整参数,则分类器应该能够分离集群。

否则,您可以手动绘制线条,计算出分隔线的斜率和截距,并使用线条分隔数据。

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from sklearn.cluster import SpectralClustering
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt
import numpy as np

s = [[0,2280,41171],[55,3375,24999],[55,733,11506],[45,3939,29185],[80,1522,14121],[70,1022,10953],
     [15,4687,24235],[55,2072,14898],[55,1755,12913],[75,2014,17938],[50,2178,14281],[5,5612,36370],
     [0,5689,38861],[5,8023,40942],[65,2777,21954],[15,7384,39900],[30,5241,35130],[40,3554,19147],
     [20,6663,37397],[5,5134,28694],[5,5273,32029],[65,514,12791],[10,7560,39851],[25,6450,36909],
     [50,1130,27140],[20,4430,23025],[0,5685,37094],[0,5949,40905],[20,6842,37547],[5,5278,31231],
     [15,7367,39031],[40,4340,31534],[35,3680,19437],[5,5236,30761],[5,2104,29053],[0,5947,40685],
     [45,3128,17475],[40,4386,31495],[50,3922,31394],[15,7307,38805],[55,3403,26704],[70,2604,20509],
     [5,5574,34118],[55,733,11668],[20,6663,37223],[25,6430,37171],[55,1815,12632],[60,3094,23472],
     [30,5798,36262],[30,5293,34687],[20,6554,37454],[35,4767,34735],[40,4411,31716],[30,5427,35581],
     [40,3350,18316],[50,1075,14794],[85,948,13668],[80,1601,16079],[5,4868,26220],[20,6554,37075],
     [5,2100,33351],[75,666,5799],[50,980,15290],[95,387,7418],[30,1715,20606],[15,1980,25981],
     [35,4759,30730],[20,4603,24254],[5,5059,28033],[5,5257,32243],[45,1308,16861],[0,5849,38680],
     [85,414,6927],[0,2148,35148],[70,2551,21015],[35,4581,32535],[80,561,6001],[0,5672,35715],
     [5,5152,33120],[35,4984,34437],[55,3574,27528],[35,3762,19995],[30,5798,39146],[0,5911,40312],
     [85,387,5917],[35,4581,35933],[55,754,11654],[40,3610,25147],[0,2252,39270],[5,2042,34883],
     [0,6032,41330],[80,1826,20158],[30,4075,21742],[10,7517,40283],[45,3029,19383],[30,4933,32675],
     [40,1479,21945],[10,4826,25687],[25,6380,37256],[75,364,8215]]

X = StandardScaler().fit_transform(np.array(s))[:, :2]
plt.scatter(X[:, 0], X[:, 1], s=20)
plt.show()

models = (
    DBSCAN(eps=0.5, min_samples=2),
    SpectralClustering(n_clusters=3, assign_labels="discretize"),
    KMeans(n_clusters=3),
    AgglomerativeClustering(n_clusters=3, )
    )

for m in models:
    m.fit(X)
    plt.scatter(X[:, 0], X[:, 1], s=20, c=m.labels_)
    plt.show()
© www.soinside.com 2019 - 2024. All rights reserved.