如何从数据框中提取所有数字和符号(标点符号)直到空格

问题描述 投票:0回答:2

我有一个包含字符列的数据框。第一列 (V1) 包含 ID,后面是多列,其中包含带有数字、字母和符号的字符串。我想要的是提取所有数值和符号,直到字符串中有空格。 理想情况下,我希望将 V2 列中的所有数值和符号写入带有“;”的新列。分隔符。

   df2 <- structure(list(V1 = c("00094", "00001", "00002", "00003", "00004", 
"00005", "00006", "00007", "00008", "00009"), V2 = c("4-6-2021 (aw), vaccinatie naam en data aangepast 19-8-2021 kv: ic nog niet ontvangen nav eerdere", 
NA, "23-7 mf: t2-vragenlijst omgezet naar t3, verzoek bij alienke om t2 af te keuren 6-12 mf: corona", 
NA, NA, "13-12 mf: 3 maanden na 2e vaccinatie corona", "20-7 mf: vaccinatiedatum blijkt enige vaccinatie 6-12 mf: corona", 
NA, NA, "15-7-2021 kv: corona gehad in maand05 2021, dus één vaccinatie. 6-12 mf: corona"
), V3 = c("eerdere brief. mf/sp telefonisch contact laten opnemen. 19-8 mf: gg, herinneringsmail gestuurd, komt niet a", 
NA, "corona gehad, 1 vaccinatie, per mail", NA, NA, "corona gekregen, per mail", 
"corona gehad, 1 vaccinatie, per mail", NA, NA, NA)), row.names = c(NA, 
10L), class = "data.frame")

这将是所需的输出(列名不重要):

df2_new <- structure(list(V1 = c("00094", "00001", "00002", "00003", "00004", 
"00005", "00006", "00007", "00008", "00009"), V2 = c("4-6-2021 (aw), vaccinatie naam en data aangepast 19-8-2021 kv: ic nog niet ontvangen nav eerdere", 
NA, "23-7 mf: t2-vragenlijst omgezet naar t3, verzoek bij alienke om t2 af te keuren 6-12 mf: corona", 
NA, NA, "13-12 mf: 3 maanden na 2e vaccinatie corona", "20-7 mf: vaccinatiedatum blijkt enige vaccinatie 6-12 mf: corona", 
NA, NA, "15-7-2021 kv: corona gehad in maand05 2021, dus één vaccinatie. 6-12 mf: corona"
), V3 = c("eerdere brief. mf/sp telefonisch contact laten opnemen. 19-8 mf: gg, herinneringsmail gestuurd, komt niet a", 
NA, "corona gehad, 1 vaccinatie, per mail", NA, NA, "corona gekregen, per mail", 
"corona gehad, 1 vaccinatie, per mail", NA, NA, NA), `dates V2` = c("4-6-2021;19-8-2021", 
NA, "23-7;6-12", NA, NA, "13-12", "20-7;6-12", NA, NA, "15-7-2021;6-12"
), `dates V3` = c("19-8", NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 
-10L), class = "data.frame")

非常感谢!

r string dataframe character
2个回答
0
投票

这将导致所需的输出:

library(stringr)
library(tidyverse)
regex <- c("(\\d+[:punct:])+\\d+")
df2$V4 <- str_extract_all(df2$V2, regex)
df2$V5 <- str_extract_all(df2$V3, regex)

df2 <- df2 %>% rowwise() %>% mutate(V4=paste(V4, sep="; ", collapse="; "),
                                    V5=paste(V5, sep="; ", collapse="; "))

0
投票

这是一个基本的 R 方法。首先定义函数来做这个:

get_all_numeric_symbols  <- function(s) {

    out  <- strsplit(s, " ") |>
        vapply(
            \(x) paste(
                grep("^[0-9-]+[0-9-]$", x, value = TRUE),
                collapse = ";"
            ),
            character(1)
        )

    out[out==""]  <- NA_character_

    out

}

然后将其应用于所需的列:

cols  <- c("V2", "V3")
new_cols  <- paste0("dates_", cols)

df2[new_cols]  <- lapply(cols, \(col) get_all_numeric_symbols(df2[[col]]))

# Check it matches expected output
identical(df2$dates_V2, df2_new$`dates V2`) # TRUE
identical(df2$dates_V3, df2_new$`dates V3`) # TRUE

注意 - 我在这里没有使用

"[:punct:]"
检查所有标点符号,因为它可以返回似乎不是日期的字符串,例如那些以逗号结尾的。我使用了一个更简单的正则表达式,
"^[0-9-]+[0-9-]$"
。这将搜索以数字或连字符开头的字符串,然后数字或连字符至少重复一次,然后以数字或连字符结尾(即字符串仅包含数字或连字符)。这也意味着您不能单独匹配单个数字。

如果你有其他想要包含的字符,你可以扩展它,例如如果您还想包含正斜杠,请将它们添加到两组方括号中

"^[0-9/-]+[0-9/-]$"
。我经常对使用
"[:punct:]"
持谨慎态度,因为它匹配 很多字符.

© www.soinside.com 2019 - 2024. All rights reserved.