我正在使用分层文档集群,实际上我的工作流程几乎是这样:
df = pandas.read_csv(file, delimiter='\t', index_col=0) # documents-terms matrix (very sparse)
dist_matrix = cosine_similarity(df)
linkage_matrix = ward(dist_matrix)
labels = fcluster(linkage_matrix, 5, criterion='maxclust')
然后我希望获得5个簇,但是当我绘制树状图时
fig, ax = plt.subplots(figsize=(15, 20)) # set size
ax = dendrogram(linkage_matrix, orientation="right")
plt.tick_params( \
axis='x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
bottom='off', # ticks along the bottom edge are off
top='off', # ticks along the top edge are off
labelbottom='off')
plt.tight_layout() # show plot with tight layout
plt.savefig('ward_clusters.png', dpi=200) # save figure as ward_clusters
我得到下图
根据颜色,我可以看到3个簇,而不是5个!我是否误解了树状图的含义?