将数据从列转换为按组累积计数的单独虚拟变量

问题描述 投票:0回答:2

如下面的示例所示,我有一个数据表(我们称之为“当前”),其中包含 3 个不同人的多行,每行都有一个日期和一个分类事件。我们可以假设我已经按时间顺序对其进行了排序,如下所示,但这可能并不重要。

我希望对于每一行,每个分类事件类型都有单独的列,这些列汇总了每个人到该时间点遇到该事件的次数(我们将此输出数据表称为“所需”)。

我编写了一些通过循环执行此操作的代码,但考虑到我的数据大小(数百万行),它显然进展不顺利...什么是正确的数据操作方法,我可以将其应用于数据表“当前”进行转换它进入数据表“所需”?

编辑:我应该提到我的原始数据集包含需要转入最终结果的列。我很抱歉最初没有包含此内容。

library(data.table)

current <- data.table(
  person.id = c(1,2,3,1,2,3),
  event = factor(c("categoryA", "categoryA", "categoryD", 
                   "categoryA", "categoryC", "categoryB")),
  date = as.Date(c("2020-01-01", "2020-03-23", "2020-09-09", 
                   "2020-12-30", "2021-06-03", "2022-03-22"))
)

desired <- current |>
  someManipulation(...)

print(current)
print(desired)

(输出)

   person.id     event       date
1:         1 categoryA 2020-01-01
2:         2 categoryA 2020-03-23
3:         3 categoryD 2020-09-09
4:         1 categoryA 2020-12-30
5:         2 categoryC 2021-06-03
6:         3 categoryB 2022-03-22

   person.id     event       date categoryA categoryB categoryC categoryD
1:         1 categoryA 2020-01-01         1         0         0         0
2:         2 categoryA 2020-03-23         1         0         0         0
3:         3 categoryD 2020-09-09         0         0         0         1
4:         1 categoryA 2020-12-30         2         0         0         0
5:         2 categoryC 2021-06-03         1         0         1         0
6:         3 categoryB 2022-03-22         0         1         0         1
r dataframe data.table
2个回答
2
投票
library(data.table)

categories <- sort(as.character(unique(current$event)))

current[, (categories) := lapply(
                            lapply(categories, function(x) +(event == x)), 
                                 cumsum), .(person.id)][]

#>    person.id     event       date categoryA categoryB categoryC categoryD
#> 1:         1 categoryA 2020-01-01         1         0         0         0
#> 2:         2 categoryA 2020-03-23         1         0         0         0
#> 3:         3 categoryD 2020-09-09         0         0         0         1
#> 4:         1 categoryA 2020-12-30         2         0         0         0
#> 5:         2 categoryC 2021-06-03         1         0         1         0
#> 6:         3 categoryB 2022-03-22         0         1         0         1

如果您愿意

dcast

result  <- dcast(current[, rid := .I], 
                    rid + person.id + date + event ~ event, 
                    fun = length)[, rid := NULL][]

cols <- setdiff(names(result), names(current))

result[, (cols) := lapply(.SD, cumsum), by = person.id, .SDcols = cols][]

#>    person.id       date     event categoryA categoryB categoryC categoryD
#> 1:         1 2020-01-01 categoryA         1         0         0         0
#> 2:         2 2020-03-23 categoryA         1         0         0         0
#> 3:         3 2020-09-09 categoryD         0         0         0         1
#> 4:         1 2020-12-30 categoryA         2         0         0         0
#> 5:         2 2021-06-03 categoryC         1         0         1         0
#> 6:         3 2022-03-22 categoryB         0         1         0         1

创建于 2024-01-29,使用 reprex v2.0.2


2
投票

您可以使用

dcast
event
列进行单热编码,然后通过人员 id 进行
cumsum
来获取事件的累积计数。

result = dcast(current, formula = person.id + date ~ event, fun.aggregate = length)

cols = names(result)[names(result) %like% "category"]
result[, (cols) := lapply(.SD, cumsum), by = person.id, .SDcols = cols]
result
#    person.id       date categoryA categoryB categoryC categoryD
# 1:         1 2020-01-01         1         0         0         0
# 2:         1 2020-12-30         2         0         0         0
# 3:         2 2020-03-23         1         0         0         0
# 4:         2 2021-06-03         1         0         1         0
# 5:         3 2020-09-09         0         0         0         1
# 6:         3 2022-03-22         0         1         0         1
© www.soinside.com 2019 - 2024. All rights reserved.