我有一个数据框,其中包含不同标准下具有不同分数的个人,请参阅 10 个个人的示例数据框,每个人有 7 个分数(A 到 G):
set.seed(123)
df <- data.frame(
A = rnorm(10, mean = 0, sd = 1),
B = rnorm(10, mean = 0, sd = 1),
C = rnorm(10, mean = 0, sd = 1),
D = rnorm(10, mean = 0, sd = 1),
E = rnorm(10, mean = 0, sd = 1),
F = rnorm(10, mean = 0, sd = 1),
G = rnorm(10, mean = 0, sd = 1)
)
# OUTPUT
A B C D E F G
1 -0.56047565 1.2240818 -1.0678237 0.42646422 -0.69470698 0.25331851 0.37963948
2 -0.23017749 0.3598138 -0.2179749 -0.29507148 -0.20791728 -0.02854676 -0.50232345
3 1.55870831 0.4007715 -1.0260044 0.89512566 -1.26539635 -0.04287046 -0.33320738
4 0.07050839 0.1106827 -0.7288912 0.87813349 2.16895597 1.36860228 -1.01857538
5 0.12928774 -0.5558411 -0.6250393 0.82158108 1.20796200 -0.22577099 -1.07179123
6 1.71506499 1.7869131 -1.6866933 0.68864025 -1.12310858 1.51647060 0.30352864
7 0.46091621 0.4978505 0.8377870 0.55391765 -0.40288484 -1.54875280 0.44820978
8 -1.26506123 -1.9666172 0.1533731 -0.06191171 -0.46665535 0.58461375 0.05300423
9 -0.68685285 0.7013559 -1.1381369 -0.30596266 0.77996512 0.12385424 0.92226747
10 -0.44566197 -0.4727914 1.2538149 -0.38047100 -0.08336907 0.21594157 2.05008469
我还有一些用户在两个列表中选择的变量名称,将分数聚类为两组(不必包括所有列,可以是动态的):
list1 <- c("B", "C", "D")
list2 <- c("A", "E", "G")
现在我需要在每个人的两组分数之间执行 t.test() 并在新列中报告 p 值
pval
。我已经设法用一个简单的 for 循环来做到这一点,但它很慢,我确信有一种更优雅的方法来做到这一点:
df$pval <- vector(mode = "list", length = dim(df)[1])
for (i in 1:dim(df)[1]) {
df[i, "pval"] <- unlist(t.test(df[i, list1], df[i, list2])$p.value)
}
如果您能帮助我了解如何使用
map()
或 apply()
函数系列为例,我将不胜感激。
ChatGPT 建议在
map2_dbl()
中使用 mutate()
,但我无法使其工作。
你当然可以使用并行循环来加速运算。例如,
library(tidyverse)
library(foreach)
library(doSNOW)
library(parallel)
registerDoSNOW(cl<-makeCluster(detectCores-2))
result<- foreach (i = 1:dim(df)[1]) %dopar%{
df[i, "pval"] <- unlist(t.test(df[i, list1], df[i, list2])$p.value)
}
对于可能遇到同样问题的人,我找到了一种非常有效的方法,使用
Rfast
包几乎可以在短时间内完成此操作。原始 for 循环运行 10,0000 个样本:
set.seed(123)
df <- data.frame(
A = rnorm(10000, mean = 0, sd = 1),
B = rnorm(10000, mean = 0, sd = 1),
C = rnorm(10000, mean = 0, sd = 1),
D = rnorm(10000, mean = 0, sd = 1),
E = rnorm(10000, mean = 0, sd = 1),
F = rnorm(10000, mean = 0, sd = 1),
G = rnorm(10000, mean = 0, sd = 1)
)
list1 <- c("B", "C", "D")
list2 <- c("A", "E", "G")
df$pval <- vector(mode = "list", length = dim(df)[1])
system.time(for (i in 1:dim(df)[1]) {
df[i, "pval"] <- unlist(t.test(df[i, list1], df[i, list2])$p.value)
})
#RUNTIME
user system elapsed
6.078 0.367 6.455
现在使用
Rfast::ttests()
功能:
df1 <- df[, list1] %>% as.matrix() %>% t()
df2 <- df[, list2] %>% as.matrix() %>% t()
system.time(Rfast::ttests(df1, df2))
#RUNTIME
user system elapsed
0.004 0.000 0.004
for 循环在 cpp 中运行,而不是在 R 中运行,使其运行效率更高。
另一种方法是编写向量化函数。请注意,您可以使用 Rcpp 使其更快:
compute_pvals <- function(df, groups){
nms <- setNames(rep(seq_along(groups), lengths(groups)), unlist(groups))
d <- df[names(nms)]
a <- tapply(unlist(d), list(row(d), nms[col(d)]),
\(x)c(mean = mean(x), n = length(x)-1, v = var(x)/length(x)))
ls1 <- do.call(rbind, a[,1])
ls2 <- do.call(rbind, a[,2])
vx <- ls1[,'v']
vy <- ls2[,'v']
stderr <- sqrt(vx + vy)
dof <- stderr^4/(vx^2/ls1[,'n'] + vy^2/ls2[,'n'])
tstat <- (ls1[,'mean'] - ls2[,'mean'])/stderr
unname(2 * pt(-abs(tstat), dof))
}
两者比较:
pvals(df, l)
[1] 0.5647411 0.3371633 0.9241260 0.7795392 0.8109423 0.9797499 0.2433040 0.9377242 0.4726551
[10] 0.7187993
for (i in 1:dim(df)[1]) {
df[i, "pval"] <- t.test(df[i, list1], df[i, list2])$p.value
}
df$pval
[1] 0.5647411 0.3371633 0.9241260 0.7795392 0.8109423 0.9797499 0.2433040 0.9377242 0.4726551
[10] 0.7187993