计算数据框中选定变量之间的逐行学生 t 检验 - R

问题描述 投票:0回答:3

我有一个数据框,其中包含不同标准下具有不同分数的个人,请参阅 10 个个人的示例数据框,每个人有 7 个分数(A 到 G):

set.seed(123)
df <- data.frame(
  A = rnorm(10, mean = 0, sd = 1),
  B = rnorm(10, mean = 0, sd = 1),
  C = rnorm(10, mean = 0, sd = 1),
  D = rnorm(10, mean = 0, sd = 1),
  E = rnorm(10, mean = 0, sd = 1),
  F = rnorm(10, mean = 0, sd = 1),
  G = rnorm(10, mean = 0, sd = 1)
)

# OUTPUT
             A          B          C           D           E           F           G
1  -0.56047565  1.2240818 -1.0678237  0.42646422 -0.69470698  0.25331851  0.37963948
2  -0.23017749  0.3598138 -0.2179749 -0.29507148 -0.20791728 -0.02854676 -0.50232345
3   1.55870831  0.4007715 -1.0260044  0.89512566 -1.26539635 -0.04287046 -0.33320738
4   0.07050839  0.1106827 -0.7288912  0.87813349  2.16895597  1.36860228 -1.01857538
5   0.12928774 -0.5558411 -0.6250393  0.82158108  1.20796200 -0.22577099 -1.07179123
6   1.71506499  1.7869131 -1.6866933  0.68864025 -1.12310858  1.51647060  0.30352864
7   0.46091621  0.4978505  0.8377870  0.55391765 -0.40288484 -1.54875280  0.44820978
8  -1.26506123 -1.9666172  0.1533731 -0.06191171 -0.46665535  0.58461375  0.05300423
9  -0.68685285  0.7013559 -1.1381369 -0.30596266  0.77996512  0.12385424  0.92226747
10 -0.44566197 -0.4727914  1.2538149 -0.38047100 -0.08336907  0.21594157  2.05008469

我还有一些用户在两个列表中选择的变量名称,将分数聚类为两组(不必包括所有列,可以是动态的):

list1 <- c("B", "C", "D")
list2 <- c("A", "E", "G")

现在我需要在每个人的两组分数之间执行 t.test() 并在新列中报告 p 值

pval
。我已经设法用一个简单的 for 循环来做到这一点,但它很慢,我确信有一种更优雅的方法来做到这一点:

df$pval <- vector(mode = "list", length = dim(df)[1])

for (i in 1:dim(df)[1]) {
  df[i, "pval"] <- unlist(t.test(df[i, list1], df[i, list2])$p.value)
}

如果您能帮助我了解如何使用

map()
apply()
函数系列为例,我将不胜感激。

ChatGPT 建议在

map2_dbl()
中使用
mutate()
,但我无法使其工作。

r dplyr statistics apply purrr
3个回答
0
投票

你当然可以使用并行循环来加速运算。例如,

library(tidyverse)
library(foreach)
library(doSNOW)
library(parallel)

registerDoSNOW(cl<-makeCluster(detectCores-2))
result<- foreach (i = 1:dim(df)[1]) %dopar%{
    df[i, "pval"] <- unlist(t.test(df[i, list1], df[i, list2])$p.value)
  }

0
投票

对于可能遇到同样问题的人,我找到了一种非常有效的方法,使用

Rfast
包几乎可以在短时间内完成此操作。原始 for 循环运行 10,0000 个样本:

set.seed(123)
df <- data.frame(
  A = rnorm(10000, mean = 0, sd = 1),
  B = rnorm(10000, mean = 0, sd = 1),
  C = rnorm(10000, mean = 0, sd = 1),
  D = rnorm(10000, mean = 0, sd = 1),
  E = rnorm(10000, mean = 0, sd = 1),
  F = rnorm(10000, mean = 0, sd = 1),
  G = rnorm(10000, mean = 0, sd = 1)
)

list1 <- c("B", "C", "D")
list2 <- c("A", "E", "G")

df$pval <- vector(mode = "list", length = dim(df)[1])

system.time(for (i in 1:dim(df)[1]) {
  df[i, "pval"] <- unlist(t.test(df[i, list1], df[i, list2])$p.value)
})

#RUNTIME
   user  system elapsed 
  6.078   0.367   6.455 

现在使用

Rfast::ttests()
功能:

df1 <- df[, list1] %>% as.matrix() %>% t()
df2 <- df[, list2] %>% as.matrix() %>% t()

system.time(Rfast::ttests(df1, df2))

#RUNTIME
   user  system elapsed 
  0.004   0.000   0.004 

for 循环在 cpp 中运行,而不是在 R 中运行,使其运行效率更高。


0
投票

另一种方法是编写向量化函数。请注意,您可以使用 Rcpp 使其更快:

compute_pvals <- function(df, groups){
    
    nms <- setNames(rep(seq_along(groups), lengths(groups)),    unlist(groups))
    d <- df[names(nms)]
    a <- tapply(unlist(d), list(row(d), nms[col(d)]),
                 \(x)c(mean = mean(x), n = length(x)-1, v = var(x)/length(x)))
    
    ls1 <- do.call(rbind, a[,1])
    ls2 <- do.call(rbind, a[,2])
    vx <- ls1[,'v']
    vy <- ls2[,'v']
    stderr <- sqrt(vx + vy)
    dof <- stderr^4/(vx^2/ls1[,'n'] + vy^2/ls2[,'n'])
    tstat <- (ls1[,'mean'] - ls2[,'mean'])/stderr
    unname(2 * pt(-abs(tstat), dof))
}

两者比较:

pvals(df, l)
[1] 0.5647411 0.3371633 0.9241260 0.7795392 0.8109423 0.9797499 0.2433040 0.9377242 0.4726551
[10] 0.7187993

for (i in 1:dim(df)[1]) {
    df[i, "pval"] <- t.test(df[i, list1], df[i, list2])$p.value
 }
df$pval
 [1] 0.5647411 0.3371633 0.9241260 0.7795392 0.8109423 0.9797499 0.2433040 0.9377242 0.4726551
[10] 0.7187993
© www.soinside.com 2019 - 2024. All rights reserved.