我有一个数据集,其中包含多个具有相似列名的列。每个变量都有多个测量值,从而产生多个列。这些形式如下: mt1_oranges_vol、mt2_oranges_vol、mt3_oranges_vol; mt1_pears_vol、mt2_pears_vol、mt3_pears_vol 等
dataset <- structure(
list(
Participant.Id = 1:5,
x1 = c(10L, 20L, 30L, 40L, 50L),
x2 = c(15L, 25L, 35L, 45L, 55L),
x3 = c(20L, 25L, NA, 45L, NA),
x4 = c(25L, 30L, NA, 50L, NA),
x5 = c(NA, 35L, NA, 55L, NA),
x6 = c(NA, 35L, NA, NA, NA),
y1 = c(10L, 20L, 30L, 40L, 50L),
y2 = c(15L, 25L, 35L, 45L, 55L),
y3 = c(20L, 25L, NA, 45L, NA),
y4 = c(25L, 30L, NA, 50L, NA),
y5 = c(NA, 35L, NA, 55L, NA),
y6 = c(NA, 35L, NA, NA, NA),
z1 = c(10L, 20L, 30L, 40L, 50L),
z2 = c(15L, 25L, 35L, 45L, 55L),
z3 = c(20L, 25L, NA, 45L, NA),
z4 = c(25L, 30L, NA, 50L, NA),
z5 = c(NA, 35L, NA, 55L, NA),
z6 = c(NA, 35L, NA, NA, NA),
mt1_oranges_vol = c(100L, 200L, 300L, 400L, 500L),
mt2_oranges_vol = c(110L, 210L, 310L, 410L, 510L),
mt3_oranges_vol = c(120L, 220L, NA, 420L, 520L),
mt4_oranges_vol = c(130L, 230L, NA, 430L, NA),
mt5_oranges_vol = c(NA, 240L, NA, NA, NA),
mt6_oranges_vol = c(NA, NA, NA, NA, NA),
mt1_pears_vol = c(101L, 201L, 301L, 401L, 501L),
mt2_pears_vol = c(111L, 211L, 311L, 411L, 511L),
mt3_pears_vol = c(121L, 221L, NA, 421L, 521L),
mt4_pears_vol = c(131L, 231L, NA, 431L, NA),
mt5_pears_vol = c(NA, 241L, NA, NA, NA),
mt6_pears_vol = c(NA, NA, NA, NA, NA),
mt1_apples_vol = c(102L, 202L, 302L, 402L, 502L),
mt2_apples_vol = c(112L, 212L, 312L, 412L, 512L),
mt3_apples_vol = c(122L, 222L, NA, 422L, 522L),
mt4_apples_vol = c(132L, 232L, NA, 432L, NA),
mt5_apples_vol = c(NA, 242L, NA, NA, NA),
mt6_apples_vol = c(NA, NA, NA, NA, NA)),
class = "data.frame",
row.names = c(NA, -5L)
)
现在,现在我只需使用
选择变量 dataset <- dataset%>%
select(Participant.Id,
matches("_vol$"))
但是,我希望能够仅包含基于变量的特定列:
mt_range <- x:y #x:y is a predefined range of numbers
我尝试过以下方法:
dataset <- dataset%>%
select(Participant.Id,
matches(paste0("mt", mt_range, "_vol"))
但是,这并没有提供预期的结果。我的理解是,它将粘贴 1:7 作为一个整体,因为我没有迭代数字,只是添加变量。我尝试使用:
dataset <- dataset %>%
select(Participant.Id,
for (i in mt range){
matches(paste0("mt", i, "_vol"))
}
但是,根据我的理解,你不能在函数内循环。此外,我认为这将给出 mt1:7_vol,但是我需要它也考虑到不同的变量名称。
所以,我的问题是:
如何实现使用 mt_range 来仅获取我感兴趣的变量。
如果有任何遗漏或我的问题提出不正确,请告诉我,我会更改它。
您可以折叠范围并将其设为要匹配的字符串。例如:
mt_range <- 3:5
# Use this regex
sprintf("^mt[%s].+_vol$", paste(mt_range, collapse = ""))
# [1] "^mt[345].+_vol$"
这将匹配以
"mt"
开头且后跟 "[345]"
范围内的任何字符,然后是任何字符集 (".+"
)(如果字符串以 "_vol"
结尾)的所有字符串。
您可以将其变成单行函数:
mt_range_builder <- \(x) sprintf("^mt[%s].+_vol$", paste(x, collapse = ""))
dataset |>
select(
Participant.Id,
matches(
mt_range_builder(mt_range)
)
)
# Participant.Id mt3_oranges_vol mt4_oranges_vol mt5_oranges_vol mt3_pears_vol mt4_pears_vol mt5_pears_vol mt3_apples_vol mt4_apples_vol mt5_apples_vol
# 1 1 120 130 NA 121 131 NA 122 132 NA
# 2 2 220 230 240 221 231 241 222 232 242
# 3 3 NA NA NA NA NA NA NA NA NA
# 4 4 420 430 NA 421 431 NA 422 432 NA
# 5 5 520 NA NA 521 NA NA 522 NA NA
更通用的功能可能是:
range_builder <- function(prefix, range, suffix) {
sprintf(
"^%s[%s].+_%s$",
prefix,
paste(range, collapse = ""),
suffix
)
}
range_builder("mt", 3:5, "vol") == mt_range_builder(3:5) # TRUE
在这种情况下,
reshape
使用长格式比使用当前的宽格式要容易得多。
首先,您绝对应该养成使用时态作为常用后缀而不是前缀的习惯。使用
grep
识别关键的 mt 列名称可以轻松完成此操作。我们在下划线处使用 strsplit
,重新排列并 paste
重新组合在一起。
mt <- grep('^mt', names(dataset))
names(dataset)[mt] <- strsplit(names(dataset)[mt], '_') |> sapply(\(x) paste(x[c(2, 3, 1)], collapse='_'))
名称现在如下所示:t1_oranges_vol、mt2_oranges_vol、...
下一个
reshape
,
dataset_lng <- reshape(dataset, idvar=1, varying=-1, direction='long', sep='')
和
subset
以获得所需的 mt_range
。
Participant.Id time x y z oranges_vol_mt pears_vol_mt apples_vol_mt
1.2 1 2 15 15 15 110 111 112
2.2 2 2 25 25 25 210 211 212
3.2 3 2 35 35 35 310 311 312
4.2 4 2 45 45 45 410 411 412
5.2 5 2 55 55 55 510 511 512
1.3 1 3 20 20 20 120 121 122
2.3 2 3 25 25 25 220 221 222
3.3 3 3 NA NA NA NA NA NA
4.3 4 3 45 45 45 420 421 422
5.3 5 3 NA NA NA 520 521 522
1.4 1 4 25 25 25 130 131 132
2.4 2 4 30 30 30 230 231 232
3.4 3 4 NA NA NA NA NA NA
4.4 4 4 50 50 50 430 431 432
5.4 5 4 NA NA NA NA NA NA
1.5 1 5 NA NA NA NA NA NA
2.5 2 5 35 35 35 240 241 242
3.5 3 5 NA NA NA NA NA NA
4.5 4 5 55 55 55 NA NA NA
5.5 5 5 NA NA NA NA NA NA
数据:
> dput(dataset)
structure(list(Participant.Id = 1:5, x1 = c(10L, 20L, 30L, 40L,
50L), x2 = c(15L, 25L, 35L, 45L, 55L), x3 = c(20L, 25L, NA, 45L,
NA), x4 = c(25L, 30L, NA, 50L, NA), x5 = c(NA, 35L, NA, 55L,
NA), x6 = c(NA, 35L, NA, NA, NA), y1 = c(10L, 20L, 30L, 40L,
50L), y2 = c(15L, 25L, 35L, 45L, 55L), y3 = c(20L, 25L, NA, 45L,
NA), y4 = c(25L, 30L, NA, 50L, NA), y5 = c(NA, 35L, NA, 55L,
NA), y6 = c(NA, 35L, NA, NA, NA), z1 = c(10L, 20L, 30L, 40L,
50L), z2 = c(15L, 25L, 35L, 45L, 55L), z3 = c(20L, 25L, NA, 45L,
NA), z4 = c(25L, 30L, NA, 50L, NA), z5 = c(NA, 35L, NA, 55L,
NA), z6 = c(NA, 35L, NA, NA, NA), mt1_oranges_vol = c(100L, 200L,
300L, 400L, 500L), mt2_oranges_vol = c(110L, 210L, 310L, 410L,
510L), mt3_oranges_vol = c(120L, 220L, NA, 420L, 520L), mt4_oranges_vol = c(130L,
230L, NA, 430L, NA), mt5_oranges_vol = c(NA, 240L, NA, NA, NA
), mt6_oranges_vol = c(NA, NA, NA, NA, NA), mt1_pears_vol = c(101L,
201L, 301L, 401L, 501L), mt2_pears_vol = c(111L, 211L, 311L,
411L, 511L), mt3_pears_vol = c(121L, 221L, NA, 421L, 521L), mt4_pears_vol = c(131L,
231L, NA, 431L, NA), mt5_pears_vol = c(NA, 241L, NA, NA, NA),
mt6_pears_vol = c(NA, NA, NA, NA, NA), mt1_apples_vol = c(102L,
202L, 302L, 402L, 502L), mt2_apples_vol = c(112L, 212L, 312L,
412L, 512L), mt3_apples_vol = c(122L, 222L, NA, 422L, 522L
), mt4_apples_vol = c(132L, 232L, NA, 432L, NA), mt5_apples_vol = c(NA,
242L, NA, NA, NA), mt6_apples_vol = c(NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-5L))
更长轴方法的示例:
library(tidyverse) # you could load individual packages as a better practice
dataset <- dataset |> pivot_longer(cols = ends_with("vol"),
names_to = c("mt_vol", "fruit", NA),
names_sep = "_") |>
mutate(mt_vol = as.numeric(str_extract(mt_vol,"\\d+")))
给出:
# A tibble: 90 × 22
Participant.Id x1 x2 x3 x4 x5 x6 y1 y2 y3 y4 y5 y6 z1 z2 z3 z4 z5 z6 mt_vol fruit value
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl> <chr> <int>
1 1 10 15 20 25 NA NA 10 15 20 25 NA NA 10 15 20 25 NA NA 1 oranges 100
2 1 10 15 20 25 NA NA 10 15 20 25 NA NA 10 15 20 25 NA NA 2 oranges 110
3 1 10 15 20 25 NA NA 10 15 20 25 NA NA 10 15 20 25 NA NA 3 oranges 120
4 1 10 15 20 25 NA NA 10 15 20 25 NA NA 10 15 20 25 NA NA 4 oranges 130
5 1 10 15 20 25 NA NA 10 15 20 25 NA NA 10 15 20 25 NA NA 5 oranges NA
6 1 10 15 20 25 NA NA 10 15 20 25 NA NA 10 15 20 25 NA NA 6 oranges NA
7 1 10 15 20 25 NA NA 10 15 20 25 NA NA 10 15 20 25 NA NA 1 pears 101
8 1 10 15 20 25 NA NA 10 15 20 25 NA NA 10 15 20 25 NA NA 2 pears 111
9 1 10 15 20 25 NA NA 10 15 20 25 NA NA 10 15 20 25 NA NA 3 pears 121
10 1 10 15 20 25 NA NA 10 15 20 25 NA NA 10 15 20 25 NA NA 4 pears 131
# … with 80 more rows
# ℹ Use `print(n = ...)` to see more rows
然后就可以非常简单地进行过滤/选择。
mt_range <- 3:5
dataset |>
filter(mt_vol %in% mt_range) |>
select(Participant.Id, mt_vol, fruit, value)
给出:
# A tibble: 45 × 4
Participant.Id mt_vol fruit value
<int> <dbl> <chr> <int>
1 1 3 oranges 120
2 1 4 oranges 130
3 1 5 oranges NA
4 1 3 pears 121
5 1 4 pears 131
6 1 5 pears NA
7 1 3 apples 122
8 1 4 apples 132
9 1 5 apples NA
10 2 3 oranges 220
# … with 35 more rows
# ℹ Use `print(n = ...)` to see more rows
如果需要,您可以将它们向后旋转得更宽。