有条件地选择观察值以跨行求和

Question

假设我有一个看起来像这样的数据集：

set.seed(123)
data <- data.frame(var1_zscore = rnorm(100),
             var2_zscore = rnorm(100),
             var3_zscore = rnorm(100),
             var4_zscore = rnorm(100),
             var5_zcore = rnorm(100))

我想根据每个变量的特定条件对每一行求和。例如，如果 var1_zscore 为 <= -1, then add the absolute value of var1_zscore to the row sum. If var2_zscore is <= -1 | >= 1，则将 var2_zscore 的绝对值添加到行总和。如果 var3_zscore >= 1，则将 var3_zscore 的绝对值添加到行总和。如果 var4_zscore >= 1，则将 var4_zscore 的绝对值添加到行总和。如果var5_zscore <= -1 | >= 1，则将var5_zscore的绝对值加到行总和上。

我想要的输出是一个名为 row_sum 的列，对于每一行，列 var1_zscore : var5_zscore 的绝对值之和。例如，第一行数据如下所示：

var1_zscore	var2_zscore	var3_zscore	var4_zscore	var5_zscore
-0.560475647	-0.71040656	2.19881035	-0.71524219	-0.07355602

所以第一行数据 $row_sum 将是 2.19881035.

我试过这样做：

data$row_sum<- rowSums(abs(data[,c((which(data$var1_zscore <= -1)),
                                    (which(data$var2_zscore <= -1 | data$var2_zscore >= 1)),
                                    (which(data$var3_zscore >= 1)),
                                    (which(data$var4_zscore >= 1)),
                                    (which(data$var5_zscore <= -1 | data$var5_zscore >= 1))
                                    )], na.rm = TRUE))

但我得到错误：不能在末尾对列进行子集化。（然后它会告诉我数据中的哪些位置不存在）。

我认为问题是我不应该使用 which 函数，但我不确定这里还能使用什么？任何帮助是极大的赞赏。非常感谢！

Answer 1

我们可以为要在 rowSums 中考虑的值创建一个逻辑索引，预先用 0 替换要忽略的值。首先创建一个条件列表。然后使用

mapply

或

purrr::map2

成对循环遍历列和条件列表以创建索引。最后，在 data.frame 上执行

rowSums

，在对

mutate

.

的调用中，将忽略的值替换为 0

library(purrr)

#list of conditions:
conditions  <- list(\(x) x <= -1,
                    \(x) x <= -1 | x >= 1,
                    \(x) x >= 1,
                    \(x) x >= 1,
                    \(x) x <= -1 | x >= 1)

#create logical index:
index <- map2(data, conditions, ~.y(.x))

#create column with rowsums of logically indexed cells:
data |>
    mutate(sums = 
               map2_dfc(abs(data),
                        index,
                        ~ifelse(.y,
                                .x,
                                0
                                )
                        ) |>
               rowSums()
           ) |>
    head(10)

   var1_zscore var2_zscore var3_zscore var4_zscore  var5_zcore     sums
1  -0.56047565 -0.71040656  2.19881035  -0.7152422 -0.07355602 2.198810
2  -0.23017749  0.25688371  1.31241298  -0.7526890 -1.16865142 2.481064
3   1.55870831 -0.24669188 -0.26514506  -0.9385387 -0.63474826 0.000000
4   0.07050839 -0.34754260  0.54319406  -1.0525133 -0.02884155 0.000000
5   0.12928774 -0.95161857 -0.41433995  -0.4371595  0.67069597 0.000000
6   1.71506499 -0.04502772 -0.47624689   0.3311792 -1.65054654 1.650547
7   0.46091621 -0.78490447 -0.78860284  -2.0142105 -0.34975424 0.000000
8  -1.26506123 -1.66794194 -0.59461727   0.2119804  0.75640644 2.933003
9  -0.68685285 -0.38022652  1.65090747   1.2366750 -0.53880916 2.887583
10 -0.44566197  0.91899661 -0.05402813   2.0375740  0.22729192 2.037574

Answer 2

这是一个在

rowSums

上使用

across

的解决方案：

library(dplyr)

name <- names(data)[endsWith(names(data), "zscore")]
lf <- list(
  \(x) ifelse({x} <= -1, abs({x}), 0),
  \(x) ifelse({x} <= -1 | {x} >= 1, abs({x}), 0),
  \(x) ifelse({x} >= 1, abs({x}), 0),
  \(x) ifelse({x} >= 1, abs({x}), 0),
  \(x) ifelse({x} <= -1 | {x} >= 1, abs({x}), 0)
) %>% 
  setNames(name)

data %>% 
  mutate(row_sums = rowSums(across(all_of(name), ~ lf[[cur_column()]](.))))

across

对通过

lf

设置并使用

cur_column()

调用的每个列应用不同的函数。

across

的输出是这些应用了函数的列的数据框。然后我们简单地取行总和。

输出

     var1_zscore var2_zscore var3_zscore var4_zscore  var5_zcore row_sums
1   -0.560475647 -0.71040656  2.19881035 -0.71524219 -0.07355602 2.198810
2   -0.230177489  0.25688371  1.31241298 -0.75268897 -1.16865142 2.481064
3    1.558708314 -0.24669188 -0.26514506 -0.93853870 -0.63474826 0.000000
4    0.070508391 -0.34754260  0.54319406 -1.05251328 -0.02884155 0.000000
5    0.129287735 -0.95161857 -0.41433995 -0.43715953  0.67069597 0.000000
6    1.715064987 -0.04502772 -0.47624689  0.33117917 -1.65054654 1.650547
7    0.460916206 -0.78490447 -0.78860284 -2.01421050 -0.34975424 0.000000
8   -1.265061235 -1.66794194 -0.59461727  0.21198043  0.75640644 2.933003
9   -0.686852852 -0.38022652  1.65090747  1.23667505 -0.53880916 2.887583
10  -0.445661970  0.91899661 -0.05402813  2.03757402  0.22729192 2.037574

数据

set.seed(123)
data <- data.frame(
  other_variable = sample(letters, 100, replace = T),
  var1_zscore = rnorm(100),
  var2_zscore = rnorm(100),
  var3_zscore = rnorm(100),
  var4_zscore = rnorm(100),
  var5_zcore = rnorm(100))

有条件地选择观察值以跨行求和

问题描述投票：0回答：2

2个回答

最新问题

有条件地选择观察值以跨行求和

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2