使用R中的余弦距离的分层聚类

问题描述 投票:1回答:1

我想通过使用余弦相似性与文档语料库的R编程语言进行层次聚类,但是我得到以下错误:

if(is.na(n)|| n> 65536L)停止时出错(“大小不能为NA也不超过65536”):缺少值需要TRUE / FALSE

我该怎么办?

为了重现它,这是一个例子:

library(tm)
doc <- c( "The sky is blue.", "The sun is bright today.", "The sun in the sky is bright.", "We can see the shining sun, the bright sun." )
doc_corpus <- Corpus( VectorSource(doc) )
control_list <- list(removePunctuation = TRUE, stopwords = TRUE, tolower = TRUE)
tdm <- TermDocumentMatrix(doc_corpus, control = control_list)



tf <- as.matrix(tdm)
( idf <- log( ncol(tf) / ( 1 + rowSums(tf != 0) ) ) )
( idf <- diag(idf) )
tf_idf <- crossprod(tf, idf)
colnames(tf_idf) <- rownames(tf)

tf_idf

cosine_dist = 1-crossprod(tf_idf) /(sqrt(colSums(tf_idf^2)%*%t(colSums(tf_idf^2))))
cluster1 <- hclust(cosine_dist, method = "ward.D")

然后我得到错误:

if(is.na(n)|| n> 65536L)停止时出错(“大小不能为NA也不超过65536”):缺少值需要TRUE / FALSE

r tm hierarchical-clustering
1个回答
1
投票

有两个问题:

1:cosine_dist = 1-crossprod(tf_idf) /(sqrt(colSums(tf_idf^2)%*%t(colSums(tf_idf^2))))创建NaN,因为你除以0。

2:hclust需要一个dist对象,而不仅仅是一个矩阵。有关详细信息,请参阅?hclust

两者都可以使用以下代码解决:

.....
cosine_dist = 1-crossprod(tf_idf) /(sqrt(colSums(tf_idf^2)%*%t(colSums(tf_idf^2))))

# remove NaN's by 0
cosine_dist[is.na(cosine_dist)] <- 0

# create dist object
cosine_dist <- as.dist(cosine_dist)

cluster1 <- hclust(cosine_dist, method = "ward.D")

plot(cluster1)

enter image description here

© www.soinside.com 2019 - 2024. All rights reserved.