需要根据数字和非数字将字符串分成多个变量

问题描述投票：0回答：2

我有一个带有一个变量的数据框。看起来像这样：

df <- data.frame(c("25 Edgemont 52 Sioux County", "57 Burke 88 Papillion-LaVista South"))

为了提供更多背景信息，每个观察/行都是一个篮球比赛得分。我想分成四个数据框列，将数字和团队名称分开。因此，例如，第一行在第一列的结尾为“ 25”，在第二列的结尾为“ Edgemont”，在第三列的结尾为“ 52”，在第四列的结尾为Sioux City。

我已经尝试了以下方法和各种建议，但无法获得预期的结果：

df2 <- strsplit(gsub("([0-9]*)([a-z]*)([0-9]*)([a-z]*)", "\\1 \\2 \\3 \\4", df), " ")

r gsub strsplit

2个回答

0
投票

1] extract中的一个选项是tidyr，在此我们在字符串的开头（(\\d+)）提取一个或多个数字（^）作为捕获组，后跟一个空格，然后是一个或多个字符，这些字符是带空格的字母，然后是空格，然后是捕获组中的一个或多个数字，然后是空格和其余字符作为第四列]

library(stringr)
library(dplyr)
library(tidyr)
df %>% 
  extract(col1, into = str_c('col', 1:4),
           '^(\\d+) ([A-Za-z ]+) (\\d+) (.*)', convert  = TRUE)
#  col1     col2 col3                    col4
#1   25 Edgemont   52            Sioux County
#2   57    Burke   88 Papillion-LaVista South
[2）

或使用separate中的tidyr，在此我们指定要在空格处分割的正则表达式外观

df %>% 
   separate(col1, into = str_c('col', 1:4), sep = '(?<=\\d) | (?=\\d)')
#  col1     col2 col3                    col4
#1   25 Edgemont   52            Sioux County
#2   57    Burke   88 Papillion-LaVista South
3）

或使用tstrsplit]中的data.table

library(data.table)
setDT(df)[, tstrsplit(col1, "(?<=\\d) | (?=\\d)", perl = TRUE)]
#   V1       V2 V3                      V4
#1: 25 Edgemont 52            Sioux County
#2: 57    Burke 88 Papillion-LaVista South
[4）
或使用read.csv中的base R（未使用包...）

read.csv(text = gsub("(?<=\\d) | (?=\\d)", ",", df$col1, 
          perl = TRUE), header = FALSE)
#  V1       V2 V3                      V4
#1 25 Edgemont 52            Sioux County
#2 57    Burke 88 Papillion-LaVista South
5）
或使用strsplit中的base R（未使用包...）

type.convert(as.data.frame(do.call(rbind, 
   strsplit(as.character(df$col1), "(?<=\\d) | (?=\\d)",
           perl = TRUE))), as.is = TRUE)
#  V1       V2 V3                      V4
#1 25 Edgemont 52            Sioux County
#2 57    Burke 88 Papillion-LaVista South
数据df <- data.frame(col1 = c("25 Edgemont 52 Sioux County", 
             "57 Burke 88 Papillion-LaVista South"))

1] dplyr / tidyr

将每个数字替换为分号，该数字和另一个分号，然后在分号和可选的空白周围进行分隔。

library(dplyr)
library(tidyr)

# input
df <- data.frame(V1 = c("25 Edgemont 52 Sioux County", 
                        "57 Burke 88 Papillion-LaVista South"))

df %>%
  mutate(V1 = gsub("(\\d+)", ";\\1;", V1)) %>%
  separate(V1, c(NA, "No1", "Let1", "No2", "Let2"), sep = " *; *")
##   No1       Let1 No2                     Let2
## 1  25  Edgemont   52             Sioux County
## 2  57     Burke   88  Papillion-LaVista South
[1a）read.table
我们可以使用与（1）中相同的gsub，然后使用read.table进行分隔。不使用任何软件包。

read.table(text = gsub("(\\d+)", ";\\1;", df$V1), sep = ";", as.is = TRUE,
  strip.white = TRUE, col.names = c(NA, "No1", "Let1", "No2", "Let2"))[-1]
##   No1     Let1 No2                    Let2
## 1  25 Edgemont  52            Sioux County
## 2  57    Burke  88 Papillion-LaVista South
[2）strcapture
我们可以使用基数R中的strcapture：

proto <- list(No1 = integer(0), Let1 = character(0),
              No2 = integer(0), Let2 = character(0))
strcapture("(\\d+) (.*) (\\d+) (.*)", df$V1, proto)
##   No1     Let1 No2                    Let2
## 1  25 Edgemont  52            Sioux County
## 2  57    Burke  88 Papillion-LaVista South
[2a）read.pattern
我们可以将read.pattern与（2）中使用的模式相同：

library(gsubfn)

read.pattern(text = format(df$V1), pattern = "(\\d+) (.*) (\\d+) (.*)", 
  col.names = c("No1", "Let1", "No2", "Let2"), as.is = TRUE, strip.white = TRUE)
##   No1     Let1 No2                    Let2
## 1  25 Edgemont  52            Sioux County
## 2  57    Burke  88 Papillion-LaVista South

0
投票

1] dplyr / tidyr

最新问题

© www.soinside.com 2019 - 2024. All rights reserved.