以下代码生成数据(source):
# Set the seed for reproducibility
set.seed(123)
# Generate random data
n <- 490
PTSD <- sample(c(1, 2, NA), n, replace = TRUE) #class(PTSD) = "numeric"
ANX <- sample(c(1, 2, NA), n, replace = TRUE) #class(ANX) = "numeric"
DEP <- sample(c(1, 2, NA), n, replace = TRUE) #class(DEP) = "numeric"
# Create the data frame
df <- data.frame(PTSD, ANX, DEP) #class(df) = "data.frame"
# Label the values: 1 = Low, 2 = High
expss::val_lab(df$PTSD) = expss::num_lab("1 Low
2 High")
expss::val_lab(df$ANX) = expss::num_lab("1 Low
2 High")
expss::val_lab(df$DEP) = expss::num_lab("1 Low
2 High")
# Create a list of tables for each variable to count 1s, 2s, and NAs
count_results <- list(
PTSD = table(df$PTSD, useNA = "ifany"),
ANX = table(df$ANX, useNA = "ifany"),
DEP = table(df$DEP, useNA = "ifany")
)
这部分代码进行了一些频率计数并总结了数据:
# Combine the count tables into a single table
count_table <- do.call(rbind, count_results)
# Initialize empty vectors to store results
variable_names <- character()
sample_sizes <- numeric()
# Loop through the test results and extract relevant information
for (variable_name in names(count_results)) {
sample_sizes <- c(sample_sizes, sum(count_results[[variable_name]]))
variable_names <- c(variable_names, variable_name)
}
# Create summary data frame
summary_df <- data.frame(
Variable = variable_names,
N = sample_sizes
)
# Combine the count table and chi-squared summary table by columns
final_result <- cbind(count_table, summary_df)
# Remove Variable column in the middle of the table
final_result <- subset(final_result, select = -c(Variable))
这部分代码执行我所说的“组合分析”(它是上述 SO 线程的答案之一):
t3 <- c("PTSD","ANX","DEP")
combs <- map(seq_along(t3),\(n)combn(t3,n,simplify = FALSE)) |> flatten()
filts <- parse_exprs(map_chr(combs,\(x)paste0(x ,'== 2',collapse=' & ')))
filtsnames <- parse_exprs(map_chr(combs,\(x)paste0(x ,collapse=' + ')))
names(filts) <- filtsnames
out2 <- map_int(filts,\(x){
df %>%
mutate(id = row_number())%>%
filter(!!(x))%>%
summarise(
n = n())
} |> pull(n)
)
enframe(out2)
最后一个命令生成这个(这是问题作者所要求的):
# A tibble: 7 × 2
name value
<chr> <int>
1 PTSD 167
2 ANX 156
3 DEP 156
4 PTSD + ANX 56
5 PTSD + DEP 52
6 ANX + DEP 51
7 PTSD + ANX + DEP 23
然而,当查看它时,组合的数量比这个要多,即(在MS Excel中生成更正的表格,函数
COUNTIFS(criteria_range1, criteria1, ...)
:
Combination n
---------------------------------
PTSD High 167
PTSD High, ANX Low 61
PTSD High, DEP Low 58
PTSD High, ANX High 56
PTSD High, DEP High 52
PTSD High, ANX Low, DEP Low 24
PTSD High, ANX High, DEP Low 14
PTSD High, ANX Low, DEP High 16
PTSD High, ANX High, DEP High 23
ANX High 156
ANX High, PTSD Low 50
ANX High, DEP Low 46
ANX High, DEP High 51
ANX High, PTSD Low, DEP Low 19
ANX High, PTSD Low, DEP High 14
DEP High 156
DEP High, PTSD Low 57
DEP High, ANX Low 52
DEP High, PTSD Low, ANX Low 17
我的问题:
这是一个有趣的问题。我不会惹麻烦
expss
:
set.seed(123)
# Generate random data
n <- 490
PTSD <- sample(c(1, 2, NA), n, replace = TRUE) #class(PTSD) = "numeric"
ANX <- sample(c(1, 2, NA), n, replace = TRUE) #class(ANX) = "numeric"
DEP <- sample(c(1, 2, NA), n, replace = TRUE) #class(DEP) = "numeric"
# Create the data frame
df <- data.frame(PTSD, ANX, DEP) #class(df) = "data.frame"
这是一个通用的解决方案。首先,创建一个向量列表——每一列一个向量。每个向量从
0
到该列中的级别数。然后,使用 expand.grid
的子集作为要计数的组合集。
lvls <- rep(list(0:2), 3) # 2 = High, 1 = Low, 0 = Any value
mCombos <- as.matrix(expand.grid(lvls))
mCombos <- mCombos[rowMaxs(as.matrix(mCombos)) == 2,]
获取
df
作为矩阵,并将 NA
值替换为 0
。
m <- as.matrix(df)
m[is.na(m)] <- 0L
对于由一行
mCombos
表示的组合,如果两行的乘积之和等于来自 m
的组合元素的平方和,则将计算一行 mCombos
。例如,对于组合 c(2, 0, 1)
(PTSD 高,DEP 低),我们将计算 m
(c(2, 2, 1)
) 的第 11 行,因为 2*2 + 0*2 + 1*1 = 2^2 + 0^2 + 1^2
。我们可以使用 tcrossprod
进行全套比较:
setNames(
cbind(
as.data.frame(mCombos),
rowSums(tcrossprod(mCombos, m) == rowSums(mCombos^2))
),
c(names(df), "n")
)
#> PTSD ANX DEP n
#> 1 2 0 0 167
#> 2 2 1 0 61
#> 3 0 2 0 156
#> 4 1 2 0 50
#> 5 2 2 0 56
#> 6 2 0 1 58
#> 7 2 1 1 70
#> 8 0 2 1 46
#> 9 1 2 1 68
#> 10 2 2 1 14
#> 11 0 0 2 156
#> 12 1 0 2 57
#> 13 2 0 2 52
#> 14 0 1 2 52
#> 15 1 1 2 58
#> 16 2 1 2 16
#> 17 0 2 2 51
#> 18 1 2 2 14
#> 19 2 2 2 23