此算法的 Python 代码，用于识别 k 均值聚类中的异常值

Question

The 有一个

input_df

，它具有stirng索引而不是整数。索引可以是“1234a”、“abcd”等任何内容。

我用

k = 100

对输入df执行了k均值，并收到了

centroid

和

labels

作为输出。

如果我没记错的话，

```
centroid
```
有 100 个值对应于 100 个这样的集群的集群内所有点的平均值。
```
labels
```
与
```
input_df
```
的大小相同，显示该点/行属于哪个簇。

我现在必须按照以下伪代码执行一个过程来识别 k-means 聚类中的异常值。

c_x : corresponding centroid of sample point x where x ∈ X

1. Compute the l2 distance of every point to its corresponding centroid.
2. t = the 0.05 or 95% percentile of the l2 distances.
3. for each sample point x in X do
4.     if || x - c_x ||2 > t then 
5.          mark x as outlier

注：第4行的

是下标

现在，我不完全理解第4行中提到的条件。

有人可以为上述算法提供等效的 Python 代码吗？

这是代码的结构。

def remove_outliers(input_df, centroids, labels):
    pass

kmeans = KMeans(n_clusters=100)
kmeans.fit(input_df)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_


filtered_centroids, filtered_labels = remove_outliers(input_df, centroids, labels)

此算法的 Python 代码，用于识别 k 均值聚类中的异常值

问题描述投票：0回答：0

最新问题

此算法的 Python 代码，用于识别 k 均值聚类中的异常值

问题描述 投票：0回答：0

最新问题

问题描述投票：0回答：0