我有一个数据框,我想通过句子中每个单词的DTM
或TDM
来获得权重。在这些权重之外,我想获得最大权重以及承载该权重的单词,然后对每个单词的权重进行计算。
我的数据框如下:
text
1. miralisitin manzoorpashteen
2. She is best of best.
3. Try again and again.
4. Beware of this woman. She is bad woman.
5. Hold! hold and hold it tight.
我希望它像:
text wordweight maxword maxcount
1. miralisitin manzoorpashteen 1 1 NA NA
2. She is best of best. 1 1 2 1 best 2
3. Try again and again. 1 2 1 again 2
4. Beware of this woman. She is bad woman. 1 1 1 2 1 1 1 woman 2
5. Hold! hold and hold it tight. 3 1 1 1 hold 3
我将如何做?
我已经使用quanteda
库尝试过此操作,但由于其dfm()
函数适用于语料库而不适用于数据帧,因此无法获得结果。也可以通过使用tm
库DTM
或TDM
来完成,但不是这样。
下面的解决方案将为您提供每个句子中单词的频率表。您应该能够发布流程并获得所需的东西。
library(stringr)
df <- structure(list(text = structure(c(3L, 4L, 5L, 1L, 2L),
.Label = c("Beware of this woman. She is bad woman.",
"Hold! hold and hold it tight.", "miralisitin manzoorpashteen",
"She is best of best.", "Try again and again."),
class = "factor")), class = "data.frame", row.names = c(NA, -5L))
lapply(df$text, function(x) {table(
tolower(
unlist(
strsplit(
gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "",
as.character(str_replace_all(x, "[^[:alnum:]]", " ")),
perl=TRUE),
" "))))})
#> [[1]]
#> manzoorpashteen miralisitin
#> 1 1
#> [[2]]
#> best is of she
#> 2 1 1 1
#>
#> [[3]]
#> again and try
#> 2 1 1
#> [[4]]
#> bad beware is of she this woman
#> 1 1 1 1 1 1 2
#>
#> [[5]]
#> and hold it tight
#> 1 3 1 1
由reprex package(v0.2.1)在2019-05-01创建