使用变量范围选择 R 中的列

问题描述 投票:0回答:3

我有一个数据集,其中包含多个具有相似列名的列。每个变量都有多个测量值,从而产生多个列。这些形式如下: mt1_oranges_vol、mt2_oranges_vol、mt3_oranges_vol; mt1_pears_vol、mt2_pears_vol、mt3_pears_vol 等

      dataset <- structure(
  list(
    Participant.Id = 1:5,
    
    x1 = c(10L, 20L, 30L, 40L, 50L),
    x2 = c(15L, 25L, 35L, 45L, 55L),
    x3 = c(20L, 25L, NA, 45L, NA),
    x4 = c(25L, 30L, NA, 50L, NA),
    x5 = c(NA, 35L, NA, 55L, NA),
    x6 = c(NA, 35L, NA, NA, NA),
    
    y1 = c(10L, 20L, 30L, 40L, 50L),
    y2 = c(15L, 25L, 35L, 45L, 55L),
    y3 = c(20L, 25L, NA, 45L, NA),
    y4 = c(25L, 30L, NA, 50L, NA),
    y5 = c(NA, 35L, NA, 55L, NA),
    y6 = c(NA, 35L, NA, NA, NA),
    
    z1 = c(10L, 20L, 30L, 40L, 50L),
    z2 = c(15L, 25L, 35L, 45L, 55L),
    z3 = c(20L, 25L, NA, 45L, NA),
    z4 = c(25L, 30L, NA, 50L, NA),
    z5 = c(NA, 35L, NA, 55L, NA),
    z6 = c(NA, 35L, NA, NA, NA),
    
    mt1_oranges_vol = c(100L, 200L, 300L, 400L, 500L),
    mt2_oranges_vol = c(110L, 210L, 310L, 410L, 510L),
    mt3_oranges_vol = c(120L, 220L, NA, 420L, 520L),
    mt4_oranges_vol = c(130L, 230L, NA, 430L, NA),
    mt5_oranges_vol = c(NA, 240L, NA, NA, NA),
    mt6_oranges_vol = c(NA, NA, NA, NA, NA),
     
    mt1_pears_vol = c(101L, 201L, 301L, 401L, 501L),
    mt2_pears_vol = c(111L, 211L, 311L, 411L, 511L),
    mt3_pears_vol = c(121L, 221L, NA, 421L, 521L),
    mt4_pears_vol = c(131L, 231L, NA, 431L, NA),
    mt5_pears_vol = c(NA, 241L, NA, NA, NA),
    mt6_pears_vol = c(NA, NA, NA, NA, NA),

    mt1_apples_vol = c(102L, 202L, 302L, 402L, 502L),
    mt2_apples_vol = c(112L, 212L, 312L, 412L, 512L),
    mt3_apples_vol = c(122L, 222L, NA, 422L, 522L),
    mt4_apples_vol = c(132L, 232L, NA, 432L, NA),
    mt5_apples_vol = c(NA, 242L, NA, NA, NA),
    mt6_apples_vol = c(NA, NA, NA, NA, NA)),


  class = "data.frame", 
  row.names = c(NA, -5L)
)

现在,现在我只需使用

选择变量
   dataset <- dataset%>% 
   select(Participant.Id,
         matches("_vol$"))

但是,我希望能够仅包含基于变量的特定列:

mt_range <- x:y #x:y is a predefined range of numbers

我尝试过以下方法:

dataset <- dataset%>% 
     select(Participant.Id,
     matches(paste0("mt", mt_range, "_vol"))

但是,这并没有提供预期的结果。我的理解是,它将粘贴 1:7 作为一个整体,因为我没有迭代数字,只是添加变量。我尝试使用:

  dataset <- dataset %>% 
       select(Participant.Id,
             for (i in mt range){
                        matches(paste0("mt", i, "_vol"))
                        }

但是,根据我的理解,你不能在函数内循环。此外,我认为这将给出 mt1:7_vol,但是我需要它也考虑到不同的变量名称。

所以,我的问题是:

如何实现使用 mt_range 来仅获取我感兴趣的变量。

如果有任何遗漏或我的问题提出不正确,请告诉我,我会更改它。

r data-analysis data-manipulation
3个回答
1
投票

您可以折叠范围并将其设为要匹配的字符串。例如:

mt_range <- 3:5

# Use this regex
sprintf("^mt[%s].+_vol$", paste(mt_range, collapse = ""))
# [1] "^mt[345].+_vol$"

这将匹配以

"mt"
开头且后跟
"[345]"
范围内的任何字符,然后是任何字符集 (
".+"
)(如果字符串以
"_vol"
结尾)的所有字符串。

您可以将其变成单行函数:

mt_range_builder <- \(x) sprintf("^mt[%s].+_vol$", paste(x, collapse = ""))
dataset |>
    select(
        Participant.Id,
        matches(
            mt_range_builder(mt_range)
        )
    )

#   Participant.Id mt3_oranges_vol mt4_oranges_vol mt5_oranges_vol mt3_pears_vol mt4_pears_vol mt5_pears_vol mt3_apples_vol mt4_apples_vol mt5_apples_vol
# 1              1             120             130              NA           121           131            NA            122            132             NA
# 2              2             220             230             240           221           231           241            222            232            242
# 3              3              NA              NA              NA            NA            NA            NA             NA             NA             NA
# 4              4             420             430              NA           421           431            NA            422            432             NA
# 5              5             520              NA              NA           521            NA            NA            522             NA             NA

更通用的功能可能是:

range_builder <- function(prefix, range, suffix) {
    sprintf(
        "^%s[%s].+_%s$",
        prefix,
        paste(range, collapse = ""),
        suffix
    )
}

range_builder("mt", 3:5, "vol") == mt_range_builder(3:5) # TRUE

1
投票

在这种情况下,

reshape
使用长格式比使用当前的宽格式要容易得多。

首先,您绝对应该养成使用时态作为常用后缀而不是前缀的习惯。使用

grep
识别关键的 mt 列名称可以轻松完成此操作。我们在下划线处使用
strsplit
,重新排列并
paste
重新组合在一起。

mt <- grep('^mt', names(dataset))
names(dataset)[mt] <- strsplit(names(dataset)[mt], '_') |> sapply(\(x) paste(x[c(2, 3, 1)], collapse='_'))

名称现在如下所示:t1_oranges_vol、mt2_oranges_vol、...

下一个

reshape

dataset_lng <- reshape(dataset, idvar=1, varying=-1, direction='long', sep='')

subset
以获得所需的
mt_range

    Participant.Id time  x  y  z oranges_vol_mt pears_vol_mt apples_vol_mt
1.2              1    2 15 15 15            110          111           112
2.2              2    2 25 25 25            210          211           212
3.2              3    2 35 35 35            310          311           312
4.2              4    2 45 45 45            410          411           412
5.2              5    2 55 55 55            510          511           512
1.3              1    3 20 20 20            120          121           122
2.3              2    3 25 25 25            220          221           222
3.3              3    3 NA NA NA             NA           NA            NA
4.3              4    3 45 45 45            420          421           422
5.3              5    3 NA NA NA            520          521           522
1.4              1    4 25 25 25            130          131           132
2.4              2    4 30 30 30            230          231           232
3.4              3    4 NA NA NA             NA           NA            NA
4.4              4    4 50 50 50            430          431           432
5.4              5    4 NA NA NA             NA           NA            NA
1.5              1    5 NA NA NA             NA           NA            NA
2.5              2    5 35 35 35            240          241           242
3.5              3    5 NA NA NA             NA           NA            NA
4.5              4    5 55 55 55             NA           NA            NA
5.5              5    5 NA NA NA             NA           NA            NA

数据:

> dput(dataset)
structure(list(Participant.Id = 1:5, x1 = c(10L, 20L, 30L, 40L, 
50L), x2 = c(15L, 25L, 35L, 45L, 55L), x3 = c(20L, 25L, NA, 45L, 
NA), x4 = c(25L, 30L, NA, 50L, NA), x5 = c(NA, 35L, NA, 55L, 
NA), x6 = c(NA, 35L, NA, NA, NA), y1 = c(10L, 20L, 30L, 40L, 
50L), y2 = c(15L, 25L, 35L, 45L, 55L), y3 = c(20L, 25L, NA, 45L, 
NA), y4 = c(25L, 30L, NA, 50L, NA), y5 = c(NA, 35L, NA, 55L, 
NA), y6 = c(NA, 35L, NA, NA, NA), z1 = c(10L, 20L, 30L, 40L, 
50L), z2 = c(15L, 25L, 35L, 45L, 55L), z3 = c(20L, 25L, NA, 45L, 
NA), z4 = c(25L, 30L, NA, 50L, NA), z5 = c(NA, 35L, NA, 55L, 
NA), z6 = c(NA, 35L, NA, NA, NA), mt1_oranges_vol = c(100L, 200L, 
300L, 400L, 500L), mt2_oranges_vol = c(110L, 210L, 310L, 410L, 
510L), mt3_oranges_vol = c(120L, 220L, NA, 420L, 520L), mt4_oranges_vol = c(130L, 
230L, NA, 430L, NA), mt5_oranges_vol = c(NA, 240L, NA, NA, NA
), mt6_oranges_vol = c(NA, NA, NA, NA, NA), mt1_pears_vol = c(101L, 
201L, 301L, 401L, 501L), mt2_pears_vol = c(111L, 211L, 311L, 
411L, 511L), mt3_pears_vol = c(121L, 221L, NA, 421L, 521L), mt4_pears_vol = c(131L, 
231L, NA, 431L, NA), mt5_pears_vol = c(NA, 241L, NA, NA, NA), 
    mt6_pears_vol = c(NA, NA, NA, NA, NA), mt1_apples_vol = c(102L, 
    202L, 302L, 402L, 502L), mt2_apples_vol = c(112L, 212L, 312L, 
    412L, 512L), mt3_apples_vol = c(122L, 222L, NA, 422L, 522L
    ), mt4_apples_vol = c(132L, 232L, NA, 432L, NA), mt5_apples_vol = c(NA, 
    242L, NA, NA, NA), mt6_apples_vol = c(NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, 
-5L))

0
投票

更长轴方法的示例:

library(tidyverse) # you could load individual packages as a better practice
dataset <- dataset |> pivot_longer(cols = ends_with("vol"),
                        names_to = c("mt_vol", "fruit", NA),
                        names_sep = "_") |> 
  mutate(mt_vol = as.numeric(str_extract(mt_vol,"\\d+")))

给出:

# A tibble: 90 × 22
   Participant.Id    x1    x2    x3    x4    x5    x6    y1    y2    y3    y4    y5    y6    z1    z2    z3    z4    z5    z6 mt_vol fruit   value
            <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>  <dbl> <chr>   <int>
 1              1    10    15    20    25    NA    NA    10    15    20    25    NA    NA    10    15    20    25    NA    NA      1 oranges   100
 2              1    10    15    20    25    NA    NA    10    15    20    25    NA    NA    10    15    20    25    NA    NA      2 oranges   110
 3              1    10    15    20    25    NA    NA    10    15    20    25    NA    NA    10    15    20    25    NA    NA      3 oranges   120
 4              1    10    15    20    25    NA    NA    10    15    20    25    NA    NA    10    15    20    25    NA    NA      4 oranges   130
 5              1    10    15    20    25    NA    NA    10    15    20    25    NA    NA    10    15    20    25    NA    NA      5 oranges    NA
 6              1    10    15    20    25    NA    NA    10    15    20    25    NA    NA    10    15    20    25    NA    NA      6 oranges    NA
 7              1    10    15    20    25    NA    NA    10    15    20    25    NA    NA    10    15    20    25    NA    NA      1 pears     101
 8              1    10    15    20    25    NA    NA    10    15    20    25    NA    NA    10    15    20    25    NA    NA      2 pears     111
 9              1    10    15    20    25    NA    NA    10    15    20    25    NA    NA    10    15    20    25    NA    NA      3 pears     121
10              1    10    15    20    25    NA    NA    10    15    20    25    NA    NA    10    15    20    25    NA    NA      4 pears     131
# … with 80 more rows
# ℹ Use `print(n = ...)` to see more rows

然后就可以非常简单地进行过滤/选择。

mt_range <- 3:5

dataset |>
  filter(mt_vol %in% mt_range) |>
  select(Participant.Id, mt_vol, fruit, value)

给出:

# A tibble: 45 × 4
   Participant.Id mt_vol fruit   value
            <int>  <dbl> <chr>   <int>
 1              1      3 oranges   120
 2              1      4 oranges   130
 3              1      5 oranges    NA
 4              1      3 pears     121
 5              1      4 pears     131
 6              1      5 pears      NA
 7              1      3 apples    122
 8              1      4 apples    132
 9              1      5 apples     NA
10              2      3 oranges   220
# … with 35 more rows
# ℹ Use `print(n = ...)` to see more rows

如果需要,您可以将它们向后旋转得更宽。

© www.soinside.com 2019 - 2024. All rights reserved.