我有一个包含数万行的数据框。
我想标记 Transaction 列达到或超过阈值(比如 100)的每一行,然后接下来的每一行都在 20 小时内发生并且与达到阈值的行具有相同的 UniqueID。 每次达到每个 UniqueID 的阈值时,它都需要能够执行此操作。 如果满足条件,则打上
FR
,否则打NR
.
本质上,我有 3 个相关列,想添加第四个带有分类数据的列,名为 Flagged。
library(lubridate)
UniqueID <- c(214123, 214123, 214123, 214123, 987556, 987556, 987556, 987556, 987556)
datetime <- ymd_hms("2021-12-5 21:16:00", "2021-12-6 10:16:00", "2021-12-8 08:16:00", "2021-12-30 01:26:00", "2021-12-5 10:33:00", "2021-12-6 08:16:00", "2021-12-6 13:26:00", "2022-01-6 13:26:00", "2022-01-6 13:26:00")
Transactions <- c(100, 30, 20, 110, 30, 105, 50, 20, 140)
df <- data.frame(UniqueID, datetime, Transactions)
df
UniqueID
:每个用户唯一的标识符
datetime
:交易发生时
Transactions
:交易金额
在上面的示例中,第 1、2、4、6、7、9 行应标记为
FR
,而其他行为NR
。最终,它应该看起来像:
UniqueID <- c(214123, 214123, 214123, 214123, 987556, 987556, 987556, 987556, 987556)
datetime <- ymd_hms("2021-12-5 21:16:00", "2021-12-6 10:16:00", "2021-12-8 08:16:00", "2021-12-30 01:26:00", "2021-12-5 10:33:00", "2021-12-6 08:16:00", "2021-12-6 13:26:00", "2022-01-6 13:26:00", "2022-01-6 13:26:00")
Transactions <- c(100, 30, 20, 110, 30, 105, 50, 20, 140)
Flagged <- c("FR", "FR", "NR", "FR", "NR", "FR", "FR", "NR", "FR")
df <- data.frame(UniqueID, datetime, Transactions, Flagged)
df
library(dplyr)
library(tidyr) # fill
df %>%
group_by(UniqueID) %>%
mutate(last100 = if_else(Transactions >= 100, datetime, datetime[NA])) %>%
fill(last100) %>%
mutate(Flagged = coalesce(if_else(difftime(datetime, last100, units = "hours") <= 20, "FR", "NR"), "NA")) %>%
ungroup() %>%
select(-last100)
# # A tibble: 9 × 4
# UniqueID datetime Transactions Flagged
# <dbl> <dttm> <dbl> <chr>
# 1 214123 2021-12-05 21:16:00 100 FR
# 2 214123 2021-12-06 10:16:00 30 FR
# 3 214123 2021-12-08 08:16:00 20 NR
# 4 214123 2021-12-30 01:26:00 110 FR
# 5 987556 2021-12-05 10:33:00 30 NA
# 6 987556 2021-12-06 08:16:00 105 FR
# 7 987556 2021-12-06 13:26:00 50 FR
# 8 987556 2022-01-06 13:26:00 20 NR
# 9 987556 2022-01-06 13:26:00 140 FR
library(data.table)
DT <- as.data.table(df)
DT[Transactions >= 100, last100 := datetime
][, last100 := nafill(last100, type = "locf"), by = .(UniqueID)
][, Flagged := fcoalesce(
fifelse(difftime(datetime, last100, units = "hours") <= 20,
"FR", "NR"),
"NR")
][, last100 := NULL]
# UniqueID datetime Transactions Flagged
# <num> <POSc> <num> <char>
# 1: 214123 2021-12-05 21:16:00 100 FR
# 2: 214123 2021-12-06 10:16:00 30 FR
# 3: 214123 2021-12-08 08:16:00 20 NR
# 4: 214123 2021-12-30 01:26:00 110 FR
# 5: 987556 2021-12-05 10:33:00 30 NR
# 6: 987556 2021-12-06 08:16:00 105 FR
# 7: 987556 2021-12-06 13:26:00 50 FR
# 8: 987556 2022-01-06 13:26:00 20 NR
# 9: 987556 2022-01-06 13:26:00 140 FR