如下面的示例所示,我有一个数据表(我们称之为“当前”),其中包含 3 个不同人的多行,每行都有一个日期和一个分类事件。我们可以假设我已经按时间顺序对其进行了排序,如下所示,但这可能并不重要。
我希望对于每一行,每个分类事件类型都有单独的列,这些列汇总了每个人到该时间点遇到该事件的次数(我们将此输出数据表称为“所需”)。
我编写了一些通过循环执行此操作的代码,但考虑到我的数据大小(数百万行),它显然进展不顺利...什么是正确的数据操作方法,我可以将其应用于数据表“当前”进行转换它进入数据表“所需”?
编辑:我应该提到我的原始数据集包含需要转入最终结果的列。我很抱歉最初没有包含此内容。
library(data.table)
current <- data.table(
person.id = c(1,2,3,1,2,3),
event = factor(c("categoryA", "categoryA", "categoryD",
"categoryA", "categoryC", "categoryB")),
date = as.Date(c("2020-01-01", "2020-03-23", "2020-09-09",
"2020-12-30", "2021-06-03", "2022-03-22"))
)
desired <- current |>
someManipulation(...)
print(current)
print(desired)
(输出)
person.id event date
1: 1 categoryA 2020-01-01
2: 2 categoryA 2020-03-23
3: 3 categoryD 2020-09-09
4: 1 categoryA 2020-12-30
5: 2 categoryC 2021-06-03
6: 3 categoryB 2022-03-22
person.id event date categoryA categoryB categoryC categoryD
1: 1 categoryA 2020-01-01 1 0 0 0
2: 2 categoryA 2020-03-23 1 0 0 0
3: 3 categoryD 2020-09-09 0 0 0 1
4: 1 categoryA 2020-12-30 2 0 0 0
5: 2 categoryC 2021-06-03 1 0 1 0
6: 3 categoryB 2022-03-22 0 1 0 1
library(data.table)
categories <- sort(as.character(unique(current$event)))
current[, (categories) := lapply(
lapply(categories, function(x) +(event == x)),
cumsum), .(person.id)][]
#> person.id event date categoryA categoryB categoryC categoryD
#> 1: 1 categoryA 2020-01-01 1 0 0 0
#> 2: 2 categoryA 2020-03-23 1 0 0 0
#> 3: 3 categoryD 2020-09-09 0 0 0 1
#> 4: 1 categoryA 2020-12-30 2 0 0 0
#> 5: 2 categoryC 2021-06-03 1 0 1 0
#> 6: 3 categoryB 2022-03-22 0 1 0 1
如果您愿意
dcast
:
result <- dcast(current[, rid := .I],
rid + person.id + date + event ~ event,
fun = length)[, rid := NULL][]
cols <- setdiff(names(result), names(current))
result[, (cols) := lapply(.SD, cumsum), by = person.id, .SDcols = cols][]
#> person.id date event categoryA categoryB categoryC categoryD
#> 1: 1 2020-01-01 categoryA 1 0 0 0
#> 2: 2 2020-03-23 categoryA 1 0 0 0
#> 3: 3 2020-09-09 categoryD 0 0 0 1
#> 4: 1 2020-12-30 categoryA 2 0 0 0
#> 5: 2 2021-06-03 categoryC 1 0 1 0
#> 6: 3 2022-03-22 categoryB 0 1 0 1
创建于 2024-01-29,使用 reprex v2.0.2
您可以使用
dcast
对 event
列进行单热编码,然后通过人员 id 进行 cumsum
来获取事件的累积计数。
result = dcast(current, formula = person.id + date ~ event, fun.aggregate = length)
cols = names(result)[names(result) %like% "category"]
result[, (cols) := lapply(.SD, cumsum), by = person.id, .SDcols = cols]
result
# person.id date categoryA categoryB categoryC categoryD
# 1: 1 2020-01-01 1 0 0 0
# 2: 1 2020-12-30 2 0 0 0
# 3: 2 2020-03-23 1 0 0 0
# 4: 2 2021-06-03 1 0 1 0
# 5: 3 2020-09-09 0 0 0 1
# 6: 3 2022-03-22 0 1 0 1