计算字符串中的单词,按年份分组

问题描述 投票:0回答:1

我正在尝试使用R在字符串中查找流行的单词,这可能是最容易用示例解释的。

以此为输入(有数百万个条目,每个日期可以出现数千次)

        IncorporationDate                          CompanyName
3007931        2003-05-12 OUTLANE BUSINESS CONSULTANTS LIMITED
692999         2013-03-28          AGB SERVICES ANGLIA LIMITED
2255234        2008-05-22           CIDA INTERNATIONAL LIMITED
310577         2017-09-19               FA IT SERVICES LIMITED
2020738        2012-09-03              THE SPARES SHOP LIMITED
2776144        2006-02-03         ANGELVIEW PROPERTIES LIMITED
2420435        2017-10-17                SHANE WARD TM LIMITED
2523165        2014-06-04      THE INDEPENDENT GIN COMPANY LTD
2594847        2015-05-05                  AIA ENGINEERING LTD
2701395        2015-05-27                LAURA BRIDGES LIMITED

我想找到每年使用的十大最受欢迎的单词,结果看起来像这样:

| Year | Top1    | Top1_Count | Top2 | Top2_Count | ...
| ---- | ------- | ---------- | ---- | ---------- | 
| 2017 | LIMITED | 2          | IT   | 1          |
| ...

我到目前为止最接近的是:

words <- data.frame(table(unlist(strsplit(tolower(df$SText, " "))))

但这会丢失年份数据,只会在整个数据框中提供完整的总数。

我也玩过dplyr的总结,但还没有找到办法让它做我想做的事情。

编辑:使用来自@ maurits-evers的答案我有点进一步,并发现使用这个前10名:

top_words_by_year <- words_by_year %>% group_by(year) %>% top_n(n = 10, wt = n)

只是想弄清楚如何将它变成我需要的形状

谢谢

r dataframe top-n
1个回答
1
投票

你可以这样做:

library(tidyverse);
df %>%
    mutate(year = format(as.Date(IncorporationDate, format = "%Y-%m-%d"), "%Y")) %>%
    group_by(year) %>%
    mutate(words = strsplit(as.character(CompanyName), " ")) %>%
    unnest() %>%
    count(year, words);
#  year  words             n
#<chr> <chr>         <int>
#1 2003  BUSINESS          1
#2 2003  CONSULTANTS       1
#3 2003  LIMITED           1
#4 2003  OUTLANE           1
#5 2006  ANGELVIEW         1
#6 2006  LIMITED           1
#7 2006  PROPERTIES        1
#8 2008  CIDA              1
#9 2008  INTERNATIONAL     1
#10 2008  LIMITED           1
## ... with 26 more rows

说明:IncorporationDateyear提取年份,将CompanyName分成wordsunnestcount,每个wordsyear数量。


样本数据

df <- read.table(text =
    "IncorporationDate                          CompanyName
3007931        2003-05-12 'OUTLANE BUSINESS CONSULTANTS LIMITED'
692999         2013-03-28          'AGB SERVICES ANGLIA LIMITED'
2255234        2008-05-22           'CIDA INTERNATIONAL LIMITED'
310577         2017-09-19               'FA IT SERVICES LIMITED'
2020738        2012-09-03              'THE SPARES SHOP LIMITED'
2776144        2006-02-03         'ANGELVIEW PROPERTIES LIMITED'
2420435        2017-10-17                'SHANE WARD TM LIMITED'
2523165        2014-06-04      'THE INDEPENDENT GIN COMPANY LTD'
2594847        2015-05-05                  'AIA ENGINEERING LTD'
2701395        2015-05-27                'LAURA BRIDGES LIMITED'", header = T)
© www.soinside.com 2019 - 2024. All rights reserved.