假设我在心理学领域工作,我想知道患者有多少危险因素。之后,我想列出所有风险,然后发现最普遍的风险(模式)。我正在考虑使用
mutate
,然后使用 paste0
,如果该行的值为“风险”,则获取 colname
。然而,我对此感到很难。
感谢任何帮助。
代码如下:
library(tidyverse)
df = data.frame(
patient = seq(1:60),
cancer = c("risk","ok"),
blood_pres = c("risk", "ok"),
low_education = c("risk","ok")
)
df = df %>% mutate(how_many_risks =
rowSums(. == "risk"))
让我们想出一些更有趣的数据。
set.seed(43)
df <- data.frame(patient = 1:10, cancer = sample(c("risk","ok"), size=10, replace=TRUE), blood_pres = sample(c("risk","ok"), size=10, replace=TRUE), low_education = sample(c("risk","ok"), size=10, replace=TRUE))
df
# patient cancer blood_pres low_education
# 1 1 ok risk risk
# 2 2 ok risk risk
# 3 3 ok ok ok
# 4 4 risk risk risk
# 5 5 ok ok risk
# 6 6 risk risk ok
# 7 7 ok ok ok
# 8 8 ok risk ok
# 9 9 ok ok ok
# 10 10 risk risk risk
从这里开始,我们将进行旋转、总结,然后连接回原始数据。
library(dplyr)
library(tidyr) # pivot_*
df %>%
pivot_longer(cols = -patient, values_to = "risk") %>%
filter(risk == "risk") %>%
summarize(risks = toString(name), .by = patient) %>%
left_join(df, ., by = "patient")
# patient cancer blood_pres low_education risks
# 1 1 ok risk risk blood_pres, low_education
# 2 2 ok risk risk blood_pres, low_education
# 3 3 ok ok ok <NA>
# 4 4 risk risk risk cancer, blood_pres, low_education
# 5 5 ok ok risk low_education
# 6 6 risk risk ok cancer, blood_pres
# 7 7 ok ok ok <NA>
# 8 8 ok risk ok blood_pres
# 9 9 ok ok ok <NA>
# 10 10 risk risk risk cancer, blood_pres, low_education
(请注意,使用
dplyr_1.1.0
需要 .by=
或更高版本。如果您有较旧的 dplyr 并且不会更新,请改用 group_by(patient)
而不是 .by=patient
。)
您可能需要考虑的事情:除非这仅适用于演示表格,否则将
risks
作为列表列而不是逗号分隔的字符串有时会更有利。为此,只需将 toString
替换为 list
,虽然它可能在控制台上 render 相同,但它将允许在其上执行诸如设置操作之类的操作(尽管正常的列/向量操作可能无法按您的预期工作) ):
out <- df %>%
pivot_longer(cols = -patient, values_to = "risk") %>%
filter(risk == "risk") %>%
summarize(risks = list(name), .by = patient) %>%
left_join(df, ., by = "patient")
out
# patient cancer blood_pres low_education risks
# 1 1 ok risk risk blood_pres, low_education
# 2 2 ok risk risk blood_pres, low_education
# 3 3 ok ok ok NULL
# 4 4 risk risk risk cancer, blood_pres, low_education
# 5 5 ok ok risk low_education
# 6 6 risk risk ok cancer, blood_pres
# 7 7 ok ok ok NULL
# 8 8 ok risk ok blood_pres
# 9 9 ok ok ok NULL
# 10 10 risk risk risk cancer, blood_pres, low_education
如果此数据是小标题 (
tbl_df
),则相同的数据将呈现为
tibble(out)
# # A tibble: 10 × 5
# patient cancer blood_pres low_education risks
# <int> <chr> <chr> <chr> <list>
# 1 1 ok risk risk <chr [2]>
# 2 2 ok risk risk <chr [2]>
# 3 3 ok ok ok <NULL>
# 4 4 risk risk risk <chr [3]>
# 5 5 ok ok risk <chr [1]>
# 6 6 risk risk ok <chr [2]>
# 7 7 ok ok ok <NULL>
# 8 8 ok risk ok <chr [1]>
# 9 9 ok ok ok <NULL>
# 10 10 risk risk risk <chr [3]>
我们可以直接做一些事情,比如检查该列中每一行的长度;或者快速检查确切的集合成员资格:
lengths(out$risks)
# [1] 2 2 0 3 1 2 0 1 0 3
sapply(out$risks, `%in%`, x = "cancer")
# [1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
当然,这两个都可以用正则表达式来完成,但是..如果名称有任何歧义,正则表达式会带来一点开销。
风险因素<- c('cancer', 'blood_pres', 'low_education')
c_across()
功能是您所缺少的。使用您的示例数据:
risk_factors <- c('cancer', 'blood_pres', 'low_education')
df <- df %>%
rowwise() %>%
mutate(how_many_risks = sum(c_across(all_of(risk_factors)) == "risk"),
what_risks = paste0(risk_factors[which(c_across(all_of(risk_factors)) == "risk")], collapse = ";")) %>%
ungroup()
您可以添加额外的逻辑行,将空案例报告为“无”(如您的示例中所示):
df2 <- df %>%
mutate(what_risks = if_else(what_risks == "", "none", what_risks))
我认为一次
mutate
调用就足以完成此操作(数据取自@r2evans)。
这里我没有使用
rowwise
,而是使用sapply
来迭代行以查找与“risk”匹配的值。
library(dplyr)
set.seed(43)
df <- data.frame(patient = 1:10, cancer = sample(c("risk","ok"), size=10, replace=TRUE), blood_pres = sample(c("risk","ok"), size=10, replace=TRUE), low_education = sample(c("risk","ok"), size=10, replace=TRUE))
df %>%
mutate(how_many_risks = rowSums(. == "risk"),
which_risks = ifelse(how_many_risks == 0, "no risk", paste0(sapply(1:nrow(df), \(x) paste(colnames(df[x, -1])[df[x, -1] == "risk"], collapse = ", ")))))
patient cancer blood_pres low_education how_many_risks which_risks
1 1 ok risk risk 2 blood_pres, low_education
2 2 ok risk risk 2 blood_pres, low_education
3 3 ok ok ok 0 no risk
4 4 risk risk risk 3 cancer, blood_pres, low_education
5 5 ok ok risk 1 low_education
6 6 risk risk ok 2 cancer, blood_pres
7 7 ok ok ok 0 no risk
8 8 ok risk ok 1 blood_pres
9 9 ok ok ok 0 no risk
10 10 risk risk risk 3 cancer, blood_pres, low_education