structure(list(year = c("Mar-10", "2014", "May-August",
"2009/2010", "2015", NA_character_), date = c("August 31st, 2010", "March 13th, 2015",
"May 31st, 2010", "June 16th, 2010", "May 18th, 2010", "April 7th, 2010")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
# # A tibble: 6 × 2
# year date
# <chr> <chr>
# 1 Mar-10 August 31st, 2010
# 2 2014 March 13th, 2015
# 3 May-August May 31st, 2010
# 4 2009/2010 June 16th, 2010
# 5 2015 May 18th, 2010
# 6 NA April 7th, 2010
我的目标是在开始删除与第 1 列相关的错误条目之前保留尽可能多的列,希望通过将条目简化为简单的年份值,如本示例集的第 2 行所示。
对于 NA 值,我不想删除,而是想从下一列粘贴数据。
# # A tibble: 6 × 2
# year date
# <chr> <chr>
# 1 2010 August 31st, 2010
# 2 2014 March 13th, 2015
# 3 2010 May 31st, 2010
# 4 2010 June 16th, 2010
# 5 2015 May 18th, 2010
# 6 2010 April 7th, 2010
直接关系到我给出的结构,下面应该是最终结果
structure(list(year = c("2010", "2014", "2010", "2010", "2015", "2010"), date = c("August 31st, 2010", "March 13th, 2015", "May 31st, 2010", "June 16th, 2010", "May 18th, 2010", "April 7th, 2010")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
在简单的英语中,如果该字段包含可接受的值,例如“2014”,则保留原样。如果它包含仍然可以确定年份的值,例如“Mar-10”,则使用 2010。如果无法确定年份,例如“May-August”、“2009/2010”或 NA 值,改用 Date 列中的年份。
您可以使用
coalesce
+ str_extract
:
library(dplyr)
library(stringr)
df %>%
mutate(year = coalesce(str_extract(year, "^\\d{4}$"), str_extract(date, "\\d{4}")))
# # A tibble: 6 × 2
# year date
# <chr> <chr>
# 1 2010 August 31st, 2010
# 2 2014 March 13th, 2015
# 3 2010 May 31st, 2010
# 4 2010 June 16th, 2010
# 5 2015 May 18th, 2010
# 6 2010 April 7th, 2010
如果我们要提取年份
library(dplyr)
library(stringr)
df1 %>%
mutate(year = coalesce(str_extract(year, "\\d{4}"),
str_remove(date, ".*,\\s+")))
-输出
# A tibble: 6 × 2
year date
<chr> <chr>
1 2010 August 31st, 2010
2 2014 March 13th, 2015
3 2010 May 31st, 2010
4 2009 June 16th, 2010
5 2015 May 18th, 2010
6 2010 April 7th, 2010
或与
case_when
df1 %>%
mutate(year = case_when(str_detect(year, "^\\d{4}$") ~ year,
TRUE ~ str_remove(date, ".*,\\s+")))
-输出
# A tibble: 6 × 2
year date
<chr> <chr>
1 2010 August 31st, 2010
2 2014 March 13th, 2015
3 2010 May 31st, 2010
4 2010 June 16th, 2010
5 2015 May 18th, 2010
6 2010 April 7th, 2010