使用group_by函数显示每个类别的前5个关键词

问题描述 投票:0回答:1

我正在尝试在每个产品类别的评论中找到前 5 个关键词,我有以下代码

# Group by category and count keyword frequencies
keyword_counts <- filtered_data %>%
  group_by(category, keyword) %>%
  summarise(n = n()) %>%
  arrange(desc(n))

# Find the top 5 keywords in each category
top_keywords_by_category <- keyword_counts %>%
  group_by(category) %>%
  top_n(5, wt = n) %>%
  ungroup()  # Ungroup the data

# Print the table
print(top_keywords_by_category)

提供此输出

category                                                        keyword     n
   <chr>                                                           <chr>   <int>
 1 Computers&Accessories|Accessories&Peripherals|Cables&Accessori… product   354
 2 Computers&Accessories|Accessories&Peripherals|Cables&Accessori… cable     277
 3 Computers&Accessories|Accessories&Peripherals|Cables&Accessori… chargi…   200
 4 Computers&Accessories|Accessories&Peripherals|Cables&Accessori… quality   179
 5 Computers&Accessories|Accessories&Peripherals|Cables&Accessori… nice      147
 6 Electronics|WearableTechnology|SmartWatches                     watch     129
 7 Electronics|Mobiles&Accessories|Smartphones&BasicMobiles|Smart… phone     127
 8 Electronics|HomeTheater,TV&Video|Televisions|SmartTelevisions   tv        117
 9 Electronics|WearableTechnology|SmartWatches                     product   102
10 Electronics|HomeTheater,TV&Video|Televisions|SmartTelevisions   product    80

虽然我想要的结果是

Category Computers&Accessories
Keyword             n
1 Product          354
2 Cable            277
3 Chargi...        200
4 Quality          179
5 Nice             147
r dplyr group-by tokenize summarize
1个回答
0
投票

虽然这些数据无趣,但它应该向您展示如何使用

tidyr::separate_rows

quux <- structure(list(category = c("Computers&Accessories|Accessories&Peripherals|Cables&Accessori…", "Computers&Accessories|Accessories&Peripherals|Cables&Accessori…", "Computers&Accessories|Accessories&Peripherals|Cables&Accessori…", "Computers&Accessories|Accessories&Peripherals|Cables&Accessori…", "Computers&Accessories|Accessories&Peripherals|Cables&Accessori…", "Electronics|WearableTechnology|SmartWatches", "Electronics|Mobiles&Accessories|Smartphones&BasicMobiles|Smart…", "Electronics|HomeTheater,TV&Video|Televisions|SmartTelevisions",  "Electronics|WearableTechnology|SmartWatches", "Electronics|HomeTheater,TV&Video|Televisions|SmartTelevisions"), keyword = c("product", "cable", "chargi…", "quality", "nice", "watch", "phone", "tv", "product", "product"), n = c(354L, 277L, 200L, 179L, 147L, 129L, 127L, 117L, 102L, 80L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))

library(dplyr)
quux %>%
  tidyr::separate_rows(category, sep = "\\|") %>%
  count(category, keyword) %>%
  arrange(desc(n))
# # A tibble: 32 × 3
#    category                keyword     n
#    <chr>                   <chr>   <int>
#  1 Electronics             product     2
#  2 Accessories&Peripherals cable       1
#  3 Accessories&Peripherals chargi…     1
#  4 Accessories&Peripherals nice        1
#  5 Accessories&Peripherals product     1
#  6 Accessories&Peripherals quality     1
#  7 Cables&Accessori…       cable       1
#  8 Cables&Accessori…       chargi…     1
#  9 Cables&Accessori…       nice        1
# 10 Cables&Accessori…       product     1
# # ℹ 22 more rows
# # ℹ Use `print(n = ...)` to see more rows

从这里,您可以进行前 5 名过滤和旋转:

quux %>%
  tidyr::separate_rows(category, sep = "\\|") %>%
  count(category, keyword) %>%
  slice_max(n = 5, order_by = n, with_ties = FALSE) %>%
  tidyr::pivot_wider(names_from = category, values_from = n, values_fill = list(n = 0))
# # A tibble: 4 × 3
#   keyword Electronics `Accessories&Peripherals`
#   <chr>         <int>                     <int>
# 1 product           2                         1
# 2 cable             0                         1
# 3 chargi…           0                         1
# 4 nice              0                         1
© www.soinside.com 2019 - 2024. All rights reserved.