日期列中的不同条目,目的是在删除之前保留列。如何最好地清理这样的“日期”列?

问题描述 投票:0回答:2
structure(list(year = c("Mar-10", "2014", "May-August", 
"2009/2010", "2015", NA_character_), date = c("August 31st, 2010", "March 13th, 2015", 
"May 31st, 2010", "June 16th, 2010", "May 18th, 2010", "April 7th, 2010")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

# # A tibble: 6 × 2
#   year       date             
#   <chr>      <chr>            
# 1 Mar-10     August 31st, 2010
# 2 2014       March 13th, 2015 
# 3 May-August May 31st, 2010   
# 4 2009/2010  June 16th, 2010  
# 5 2015       May 18th, 2010   
# 6 NA         April 7th, 2010 

我的目标是在开始删除与第 1 列相关的错误条目之前保留尽可能多的列,希望通过将条目简化为简单的年份值,如本示例集的第 2 行所示。

对于 NA 值,我不想删除,而是想从下一列粘贴数据。

预期产出:
# # A tibble: 6 × 2
#   year  date             
#   <chr> <chr>            
# 1 2010  August 31st, 2010
# 2 2014  March 13th, 2015 
# 3 2010  May 31st, 2010   
# 4 2010  June 16th, 2010  
# 5 2015  May 18th, 2010   
# 6 2010  April 7th, 2010

直接关系到我给出的结构,下面应该是最终结果

structure(list(year = c("2010", "2014", "2010",  "2010", "2015", "2010"), date = c("August 31st, 2010", "March 13th, 2015",  "May 31st, 2010", "June 16th, 2010", "May 18th, 2010", "April 7th, 2010")), row.names = c(NA,  -6L), class = c("tbl_df", "tbl", "data.frame"))

在简单的英语中,如果该字段包含可接受的值,例如“2014”,则保留原样。如果它包含仍然可以确定年份的值,例如“Mar-10”,则使用 2010。如果无法确定年份,例如“May-August”、“2009/2010”或 NA 值,改用 Date 列中的年份。

r dplyr tidyverse lubridate
2个回答
1
投票

您可以使用

coalesce
+
str_extract

library(dplyr)
library(stringr)

df %>%
  mutate(year = coalesce(str_extract(year, "^\\d{4}$"), str_extract(date, "\\d{4}")))

# # A tibble: 6 × 2
#   year  date             
#   <chr> <chr>            
# 1 2010  August 31st, 2010
# 2 2014  March 13th, 2015 
# 3 2010  May 31st, 2010   
# 4 2010  June 16th, 2010  
# 5 2015  May 18th, 2010   
# 6 2010  April 7th, 2010

1
投票

如果我们要提取年份

library(dplyr)
library(stringr)
df1 %>%
   mutate(year =  coalesce(str_extract(year, "\\d{4}"), 
                           str_remove(date, ".*,\\s+")))

-输出

# A tibble: 6 × 2
  year  date             
  <chr> <chr>            
1 2010  August 31st, 2010
2 2014  March 13th, 2015 
3 2010  May 31st, 2010   
4 2009  June 16th, 2010  
5 2015  May 18th, 2010   
6 2010  April 7th, 2010  

或与

case_when

df1 %>%
  mutate(year = case_when(str_detect(year, "^\\d{4}$") ~ year,
    TRUE ~ str_remove(date, ".*,\\s+")))

-输出

# A tibble: 6 × 2
  year  date             
  <chr> <chr>            
1 2010  August 31st, 2010
2 2014  March 13th, 2015 
3 2010  May 31st, 2010   
4 2010  June 16th, 2010  
5 2015  May 18th, 2010   
6 2010  April 7th, 2010  
© www.soinside.com 2019 - 2024. All rights reserved.