聚集聚类设置distance_threshold

Question

我有一个数据集，我想使用 AgglomerativeClustering 来查找聚类。

我尝试了一些示例数组，但无法弄清楚如何设置 distance_threshold。我考虑使用它是因为我不知道类似数据集的簇数量。

示例代码如下。

corpus = ['Rose is a flower.', 'Apple is a fruit', 'Lily is a flower', 'Banana is a fruit', 'Jackfruit is a fruit', 'Mango is a fruit']
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
transformer = FunctionTransformer(lambda x: X.todense(), accept_sparse=True)
X_dense = transformer.transform(X)
AG = AgglomerativeClustering(n_clusters=None, distance_threshold =2, linkage='ward')
y_km = AG.fit_predict(X_dense)

我的问题是；

如果我使用 distance_threshold 为“2”，我会得到一个簇中的所有记录。如果我使用“1”，我会得到六个簇。但如果我选择“1.5”，我会得到 2 个簇；对于这个例子来说这是正确的。这是创建的示例数据，因此我可以尝试并检查其正确性，但是如何为生产类型的代码选择它？

有更好的选择 distance_threshold 的方法吗？

Answer 1

我自己一直在寻找这个问题的答案，但没有运气。所以我会把我找到的东西放在这里，供下一个搜索者使用。

如果

n_features, n_samples = X_dense.shape

，那么我会认为

distance_threshold

与

X_dense

中的最大欧几里得距离相关。为简单起见，如果特征被归一化，那么这将是

sqrt(n_features)

（每个维度的最大距离 1）。

但是，这与

AgglomerativeClustering

距离输出不匹配。我发现（我认为）标准化情况下最大距离的正确公式是

max_dist = np.sqrt(n_features * (2 - 2/n_samples))

可能有一个很好的理由，但我不知道。

这样，您可以将

distance_threshold

设置为更容易理解的

normalized_distance_threshold * max_dist

，其中

normalized_distance_threshold=0

表示单样本聚类，

=1

表示单个聚类。

聚集聚类设置distance_threshold

问题描述投票：0回答：1

1个回答

最新问题

聚集聚类设置distance_threshold

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1