我有一个数据框,我希望通过DTM
或TDM
得到句子中每个单词的权重。在这些重量中,我想得到最大重量以及带有该重量的单词,然后我想对每个单词重量应用计算。
我的数据框如下:
text
1. miralisitin manzoorpashteen
2. She is best of best.
3. Try again and again.
4. Beware of this woman. She is bad woman.
5. Hold! hold and hold it tight.
我希望它像:
text wordweight maxword maxcount
1. miralisitin manzoorpashteen 1 1 NA NA
2. She is best of best. 1 1 2 1 best 2
3. Try again and again. 1 2 1 again 2
4. Beware of this woman. She is bad woman. 1 1 1 2 1 1 1 woman 2
5. Hold! hold and hold it tight. 3 1 1 1 hold 3
我该怎么做?
我已经尝试使用quanteda
库,但不会得到结果,因为它的dfm()
函数适用于语料库而不是数据帧。它也可以通过使用tm
库DTM
或TDM
来完成,但不是这样的。
下面的解决方案将为您提供每个句子中的单词频率表。您应该能够发布流程并获得所需内容。
library(stringr)
df <- structure(list(text = structure(c(3L, 4L, 5L, 1L, 2L),
.Label = c("Beware of this woman. She is bad woman.",
"Hold! hold and hold it tight.", "miralisitin manzoorpashteen",
"She is best of best.", "Try again and again."),
class = "factor")), class = "data.frame", row.names = c(NA, -5L))
lapply(df$text, function(x) {table(
tolower(
unlist(
strsplit(
gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "",
as.character(str_replace_all(x, "[^[:alnum:]]", " ")),
perl=TRUE),
" "))))})
#> [[1]]
#> manzoorpashteen miralisitin
#> 1 1
#> [[2]]
#> best is of she
#> 2 1 1 1
#>
#> [[3]]
#> again and try
#> 2 1 1
#> [[4]]
#> bad beware is of she this woman
#> 1 1 1 1 1 1 2
#>
#> [[5]]
#> and hold it tight
#> 1 3 1 1
由reprex package创建于2019-05-01(v0.2.1)