R 中的频率计数和所有组合分析?

问题描述 投票:0回答:1

以下代码生成数据(source):

# Set the seed for reproducibility
set.seed(123)

# Generate random data
n <- 490
PTSD <- sample(c(1, 2, NA), n, replace = TRUE) #class(PTSD) = "numeric"
ANX <- sample(c(1, 2, NA), n, replace = TRUE) #class(ANX) = "numeric"
DEP <- sample(c(1, 2, NA), n, replace = TRUE) #class(DEP) = "numeric"

# Create the data frame
df <- data.frame(PTSD, ANX, DEP) #class(df) = "data.frame"

# Label the values: 1 = Low, 2 = High
expss::val_lab(df$PTSD) = expss::num_lab("1 Low
                                        2 High")
expss::val_lab(df$ANX) = expss::num_lab("1 Low
                                        2 High")
expss::val_lab(df$DEP) = expss::num_lab("1 Low
                                        2 High")

# Create a list of tables for each variable to count 1s, 2s, and NAs
count_results <- list(
  PTSD = table(df$PTSD, useNA = "ifany"),
  ANX = table(df$ANX, useNA = "ifany"),
  DEP = table(df$DEP, useNA = "ifany")
)

这部分代码进行了一些频率计数并总结了数据:

# Combine the count tables into a single table
count_table <- do.call(rbind, count_results)

# Initialize empty vectors to store results
variable_names <- character()
sample_sizes <- numeric()

# Loop through the test results and extract relevant information
for (variable_name in names(count_results)) {
  sample_sizes <- c(sample_sizes, sum(count_results[[variable_name]]))
  variable_names <- c(variable_names, variable_name)
}

# Create summary data frame
summary_df <- data.frame(
  Variable = variable_names,
  N = sample_sizes
)

# Combine the count table and chi-squared summary table by columns
final_result <- cbind(count_table, summary_df)

# Remove Variable column in the middle of the table
final_result <- subset(final_result, select = -c(Variable))

这部分代码执行我所说的“组合分析”(它是上述 SO 线程的答案之一):

t3 <- c("PTSD","ANX","DEP")

combs <- map(seq_along(t3),\(n)combn(t3,n,simplify = FALSE)) |> flatten()

filts <- parse_exprs(map_chr(combs,\(x)paste0(x ,'== 2',collapse=' & ')))
filtsnames <- parse_exprs(map_chr(combs,\(x)paste0(x ,collapse=' + ')))
names(filts) <- filtsnames

out2 <- map_int(filts,\(x){
  df %>%
    mutate(id = row_number())%>%
    filter(!!(x))%>%
    summarise(
      n = n())
} |> pull(n)
)

enframe(out2)

最后一个命令生成这个(这是问题作者所要求的):

# A tibble: 7 × 2
  name             value
  <chr>            <int>
1 PTSD               167
2 ANX                156
3 DEP                156
4 PTSD + ANX          56
5 PTSD + DEP          52
6 ANX + DEP           51
7 PTSD + ANX + DEP    23

然而,当查看它时,组合的数量比这个要多,即(在MS Excel中生成更正的表格,函数

COUNTIFS(criteria_range1, criteria1, ...)

Combination                     n
---------------------------------
PTSD High                     167
PTSD High, ANX Low             61
PTSD High, DEP Low             58
PTSD High, ANX High            56
PTSD High, DEP High            52
PTSD High, ANX Low, DEP Low    24
PTSD High, ANX High, DEP Low   14
PTSD High, ANX Low, DEP High   16
PTSD High, ANX High, DEP High  23
    
ANX High                      156
ANX High, PTSD Low             50
ANX High, DEP Low              46
ANX High, DEP High             51
ANX High, PTSD Low, DEP Low    19
ANX High, PTSD Low, DEP High   14
    
DEP High                      156
DEP High, PTSD Low             57
DEP High, ANX Low              52
DEP High, PTSD Low, ANX Low    17

我的问题:

  • 假设上表中没有缺失的组合,为了获得所有组合和相关频率,R代码是什么?
r combinations frequency
1个回答
0
投票

这是一个有趣的问题。我不会惹麻烦

expss
:

set.seed(123)

# Generate random data
n <- 490
PTSD <- sample(c(1, 2, NA), n, replace = TRUE) #class(PTSD) = "numeric"
ANX <- sample(c(1, 2, NA), n, replace = TRUE) #class(ANX) = "numeric"
DEP <- sample(c(1, 2, NA), n, replace = TRUE) #class(DEP) = "numeric"

# Create the data frame
df <- data.frame(PTSD, ANX, DEP) #class(df) = "data.frame"

这是一个通用的解决方案。首先,创建一个向量列表——每一列一个向量。每个向量从

0
到该列中的级别数。然后,使用
expand.grid
的子集作为要计数的组合集。

lvls <- rep(list(0:2), 3) # 2 = High, 1 = Low, 0 = Any value
mCombos <- as.matrix(expand.grid(lvls))
mCombos <- mCombos[rowMaxs(as.matrix(mCombos)) == 2,]

获取

df
作为矩阵,并将
NA
值替换为
0

m <- as.matrix(df)
m[is.na(m)] <- 0L

对于由一行

mCombos
表示的组合,如果两行的乘积之和等于来自
m
的组合元素的平方和,则将计算一行
mCombos
。例如,对于组合
c(2, 0, 1)
(PTSD 高,DEP 低),我们将计算
m
(
c(2, 2, 1)
) 的第 11 行,因为
2*2 + 0*2 + 1*1 = 2^2 + 0^2 + 1^2
。我们可以使用
tcrossprod
进行全套比较:

setNames(
  cbind(
    as.data.frame(mCombos),
    rowSums(tcrossprod(mCombos, m) == rowSums(mCombos^2))
  ),
  c(names(df), "n")
)
#>    PTSD ANX DEP   n
#> 1     2   0   0 167
#> 2     2   1   0  61
#> 3     0   2   0 156
#> 4     1   2   0  50
#> 5     2   2   0  56
#> 6     2   0   1  58
#> 7     2   1   1  70
#> 8     0   2   1  46
#> 9     1   2   1  68
#> 10    2   2   1  14
#> 11    0   0   2 156
#> 12    1   0   2  57
#> 13    2   0   2  52
#> 14    0   1   2  52
#> 15    1   1   2  58
#> 16    2   1   2  16
#> 17    0   2   2  51
#> 18    1   2   2  14
#> 19    2   2   2  23
© www.soinside.com 2019 - 2024. All rights reserved.