如何提取字符串中某个单词前的单词？

Question

我有一个数据框架，其中列'leg_activity'的每一行都是由逗号分隔的单词组成的字符串。

structure(list(id = c("100", "100060", "100073", "100098", "100102", 
"100104", "100125", "100128", "100149", "100217", "100220", "100271", 
"100464", "100465", "100520", "100607", "100653", "100745", "100757", 
"100760"), leg_activity = c("home", "home, car, work, car, leisure, car, other, car, leisure, car, work, car, shop, car, home", 
"home, walk, leisure, walk, leisure, walk, home", "home, car, other, car, shop, car, other, car, home", 
"home, car, work, car, home, car, home", "home", "home, walk, education, walk, home", 
"home, car, other, car, work, car, shop, car, shop, car, home", 
"home, car, shop, car, work, car, home", "home, bike, leisure, bike, home", 
"home, walk, shop, walk, home", "home, pt, leisure, car, leisure, pt, home", 
"home, car, education, car, home", "home, car, leisure, car, home", 
"home, walk, home, walk, shop, walk, home", "home, pt, work, walk, leisure, walk, work, pt, home", 
"home, pt, leisure, walk, leisure, walk, home", "home, walk, home, bike, shop, bike, home", 
"home, pt, work, pt, home, walk, work, walk, home", "home")), row.names = c(2L, 
15L, 20L, 24L, 31L, 33L, 40L, 43L, 48L, 70L, 73L, 93L, 147L, 
148L, 156L, 174L, 188L, 213L, 214L, 220L), class = "data.frame")

在每个字符串中，我想提取出现在单词前的单词 work. work 可以出现多次，每次都需要提取或统计前面的词。

最终，我有兴趣统计一下哪个词在 work 在整个DF上。

我已经尝试过了。

library(dplyr)
library(stringr)

df%>%
  separate_rows(leg_activity, sep = "work, ") %>%
  group_by(id) %>%
  mutate(n = row_number()) %>%
  pivot_wider(names_from = n, values_from = leg_activity)

很明显，这样做并没有得到结果，只是把df分成了几列。所以也许另一种方法更合适。

非常感谢您的帮助

Answer 1

首先，一个稍小的数据集，以便于跟踪代码的结果。

d = data.frame(id = 1:3, leg = c("home",
                                 "work, R, eat, work",
                                 "eat, work, R, work"), stringsAsFactors = FALSE)

把字符串分割开来strsplit)上 ", ". 循环浏览结果列表 (lapply). 获取 "工作 "的指数(which(x == "work"))，得到之前的索引(-1). 使用 pmax 如果 "work "是第一个词，则得到一个空向量。对单词进行索引 (x[<the-index>]). 取消列表和计数项目 (table(unlist(...).

table(unlist(lapply(strsplit(d$leg, ", "), function(x) x[pmax(0, which(x == "work") - 1)])))
# eat   R 
#   2   1

鉴于"最终，我有兴趣统计一下在整个df中，哪个词在工作前出现的频率。"，看来分组是没有必要的。

Answer 2

你可以用 separate_rows 只是用逗号来让你的字在不同的行。然后，按以下方式分组后 id 你可以 filter 的行，其中跟随行有 "工作"？

library(dplyr)

df %>%
  separate_rows(leg_activity, sep = ",") %>%
  mutate(leg_activity = trimws(leg_activity)) %>%
  group_by(id) %>%
  filter(lead(leg_activity) == "work") %>%
  summarise(count = n())

輸出

# A tibble: 6 x 2
  id     count
  <chr>  <int>
1 100060     2
2 100102     1
3 100128     1
4 100149     1
5 100607     2
6 100757     2

Answer 3

library(stringr)
WantedStrings <- sub(", work","",str_extract_all(df$leg_activity, "\\w+, work",simplify=T))
WantedStrings <- WantedStrings[WantedStrings != ""]

table(WantedStrings)


WantedStrings
 car   pt walk 
   5    2    2

Answer 4

基地R一衬。

   table(unlist(strsplit(gsub("(\\w+\\,)\\s*(work\\,)", "\\1", 
                           lst$leg_activity), ", ")))

如何提取字符串中某个单词前的单词？

问题描述投票：1回答：3

3个回答

最新问题

如何提取字符串中某个单词前的单词？

问题描述 投票：1回答：3

3个回答

最新问题

问题描述投票：1回答：3