我在 R 中有一个数据框
mydf
.
mydf<-data.frame(
last_name <- c("Jay", "Kelly", "Mark", "Lisa", "Jay", "Kelly", "Mark", "Lisa", "Lisa", "Lisa", "Kelly", "Kelly"),
first_name <- c("Lee", "Ty", "Ben", "Joe", "Lee", "Ty", "Ben", "Joe", "Joe", "Joe", "Ty", "Ty"),
state_abbrevs <- c("KY", "UT", "OH", "IA", "KY", "UT", "OH", "IA", "IA", "IA", "UT", "UT"),
tw_year <-c(1998, 2001, 2001, 2003, 1998, 2001, 2001, 2003, 2004, 2003, 2000, 2002),
tw_month <-c(1, 3, 4, 5, 12, 1, 3, 4, 5, 3, 1, 10),
text <-c("Thanks to everyone in Orange County!", "Several cities and townships in the state are flooded", "New Mexico's communities are suffering",
"Today is the LAST day to register", "Ohio Senator Jay is great", "I love UT and KY - both are awesome",
"On Monday, the prez will release a statement", "Ohio Senator Jay is great", "The villages in Iowa are nice",
"The villages, cities and towns in iowa are nice", "Salt Late City is crazy this time of year", "I only drink S.Pellegrino"))
我还有一个名为
terms
的列表列表
terms <- list(
NH = c("New Hampshire", "NH", "Village", "Villages", "villages", "County", "county", "City", "cities"),
IA = c("iowa", "Iowa", "IA", "Village", "Villages", "villages", "County", "county", "City", "cities"),
KY = c("Kentucy", "KY", "ky", "Village", "Villages", "villages", "County", "county", "City", "cities"),
OH = c("Ohio", "OH", "oh", "Village", "Villages", "villages", "County", "county", "City", "cities"))
我想创建一个新的数据集,它提供了一个计数,即
mydf$text
中的每个观察值是否至少包含列表元素中的一个字符串(不确定列表元素是否是正确的术语?)但前提是mydf$state_abbrev
匹配列表的相应元素标签。例如,如果列表标签(“NH”、“IA”、“KY”或“OH”)与 mydf$text 观察对应的 mydf$text
匹配,我只想匹配 mydf$state_abbrev
的字符串.
在尝试了多种无效的代码变体之后,我一直在使用
dplyr
关注下面的代码 - 问题是我已经尝试解决这个问题好几天了,我需要一双新的眼睛。
library(dplyr)
df_count <- mydf %>%
mutate(state_abbrev = tolower(state_abbrev)) %>% # convert state abbreviations to lowercase
mutate(terms = terms[[state_abbrev]]) %>% # match st_terms to state_abbrev
group_by(last_name, first_name, tw_year, tw_month) %>% # group by desired variables
summarize(terms_count = sum(str_detect(text, regex(paste(terms, collapse = "|"), ignore_case = TRUE)) %>% as.integer())) # count the number of matches for each group
Error in `mutate()`:
ℹ In argument: `state_abbrev = tolower(state_abbrev)`.
Caused by error in `tolower()`:
! object 'state_abbrev' not found
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/dplyr:::mutate_error>
Error in `mutate()`:
ℹ In argument: `state_abbrev = tolower(state_abbrev)`.
Caused by error in `tolower()`:
! object 'state_abbrev' not found
---
Backtrace:
▆
1. ├─... %>% ...
2. ├─dplyr::summarize(...)
3. ├─dplyr::group_by(., last_name, first_name, tw_year, tw_month)
4. ├─dplyr::mutate(., terms = terms[[state_abbrev]])
5. ├─dplyr::mutate(., state_abbrev = tolower(state_abbrev))
6. ├─dplyr:::mutate.data.frame(., state_abbrev = tolower(state_abbrev))
7. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
8. │ ├─base::withCallingHandlers(...)
9. │ └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
10. │ └─mask$eval_all_mutate(quo)
11. │ └─dplyr (local) eval()
12. └─base::tolower(state_abbrev)
Run rlang::last_trace(drop = FALSE) to see 3 hidden frames.
新的数据集应该是这个样子:
姓氏 | 名字 | state_abbrev | tw_month | tw_year | count_text |
---|---|---|---|---|---|
杰 | 李 | KY | 1 | 1998 | 2 |
杰 | 李 | KY | 1 | 2000 | 0 |
丽莎 | 乔 | IA | 5 | 2004 | 1 |
丽莎 | 乔 | IA | 4 | 2003 | 0 |
丽莎 | 乔 | IA | 3 | 2003 | 1 |