我想向量化这个操作:
for (row1 in 1:nrow(full_df)) {
for (row2 in 1:nrow(icd10_codes)){
if (any(full_df[row1, "coding_19"] %in% icd10_codes[row2, "present_icd10"])){
full_df[row1, "code_count"] <- full_df[row1, "code_count"]+1
}
}
}
full_df 看起来像这样:
coding_19 code_count
<list> <dbl>
1 H353 0
2 <chr [8]> 0
3 <chr [2]> 0
4 E780 0
还有
> head(full_df$coding_19)
[[1]]
[1] "H353"
[[2]]
[1] "B20" "B21" "B22" "B23" "B24" "Z21" "F024" "O987"
[[3]]
[1] "G30" "F00"
[[4]]
[1] "E780"
icd10_codes
看起来像这样。 eid 是该人的 ID,present_icd10
是与该人关联的代码。
eid present_icd10
1 1 G30
2 2 E781
3 3 E780
4 4 H401, H409
5 5 H353
6 6 E780
注意
present_icd10
和 coding_19
是 n 维向量。
我想统计每个人中至少存在
full_df$coding_19
(rowise) 中的一个元素的次数 (present_icd10
)
我尝试使用这个:
full_df <- full_df %>%
rowwise() %>%
mutate(code_count = code_count + as.integer(any(coding_19 %in% icd10_codes$present_icd10)))
但我认为这只有在我有一个循环而不是嵌套循环时才有效。
基于有限样品:
library(tidyverse)
full_df %>%
mutate(code_count = map_dbl(coding_19, ~ sum(.x %in% unlist(str_split(pull(icd10_codes, present_icd10), ", ")))))
# A tibble: 4 x 2
coding_19 code_count
<list> <dbl>
1 <chr [1]> 1
2 <chr [8]> 0
3 <chr [2]> 1
4 <chr [1]> 1
数据:
structure(list(coding_19 = list("H353", c("B20", "B21", "B22",
"B23", "B24", "Z21", "F024", "O987"), c("G30", "F00"), "E780"),
code_count = c(0, 0, 0, 0)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L))
structure(list(eid = c(1, 2, 3, 4, 5, 6), present_icd10 = c("G30",
"E781", "E780", "H401, H409", "H353", "E780")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))