我有一个字符串列出了个人的任期,例如:
all_terms <- "2012 to 2024, 2007 to 2007, 2001 to 2003, 2000 to 2009, 2010 to 2011"
我想知道此人是否连续任职,这意味着:
因此上面的示例将被识别为连续的,但这个 - “1989 至 2008、2020 至 2024”则不会。
我已经想出了这段代码,但它不起作用:
all_terms <- "2012 to 2024, 2007 to 2007, 2001 to 2003, 2000 to 2009, 2010 to 2011"
# Process terms to extract years and create a data frame
terms_list <- str_split(all_terms, ",\\s*")[[1]]
years <- map(terms_list, ~str_extract_all(.x, "\\d{4}")[[1]])
years_df <- map_df(years, ~data.frame(start = as.numeric(.x[1]), end = as.numeric(.x[2])))
# Sort years by start date
years_df <- years_df %>% arrange(start)
# Adjust end year by adding one for continuity check
years_df$modified_end <- years_df$end + 1
# Check for continuity
is_continuous <- all(c(TRUE, tail(years_df$start, -1) <= head(years_df$modified_end, -1)))
# Results
list(
is_continuous = is_continuous,
start_years = min(years_df$start),
end_years = max(years_df$end)
)
这有点冗长,但却是一种简洁的方法:
all_terms <- "2012 to 2024, 2007 to 2007, 2001 to 2003, 2000 to 2009, 2010 to 2011"
library(tidyverse)
data.frame(id = 1, all_terms) |>
separate_longer_delim(all_terms, delim = ", ") |>
separate_wider_delim(cols = all_terms, names = c("from", "to"), delim = " to ") |>
mutate(row = row_number()) |>
summarize(year = seq(from, to, 1), .by = c(id, row)) |>
distinct(id, year) |>
arrange(id, year) |>
summarize(terms = max(cumsum(year > lag(year,1,0) + 1)), .by = id)
这会将字符串放入数据帧中,在每个
,
处拆分成行,将其拆分为 from
和 to
,然后创建覆盖该范围的年份序列,为每个 id 选择每年之一,然后检查如何每个 id 有很多空白。
它报告原始数据的一项,第二数据的两项。