我有一个大型数据文件,其中包含许多引用的不同日期和数量。每一行都是一个交易,具有日期和数量。我需要找出低于阈值的交易是否先于更大的交易(就数量而言)。我已经实现了这个目标,但却无法想到一个不太复杂的方法,我确信这个方法存在。我很欣赏任何提示。下面是一个完全可重现的例子:
# load required package
require(data.table)
# make it fully reproducible
set.seed(1)
a <- data.table(ref = sample(LETTERS[1:10], 300, TRUE), dates = sample(seq(as.Date("2017-08-01"), as.Date("2017-12-01"), "day"), 300, TRUE), qty = sample(1:500, 300, TRUE))
# Compute some intermediate tables
# First one has all records below the threshold (20) with their dates
temp1 <- a[, .(dates, qLess = qty < 20, qty), by = ref][qLess == TRUE,]
# Second one has all records above threshold with minimum dates
temp2 <- a[, .(qGeq = qty >= 20, dates), by = ref][qGeq == TRUE,][, min(dates), by = ref]
# Join both tables on ref, filter those below the threshold and filter the ones that are actually preceded (prec) by a larger order. THIS IS THE EXPECTED RESULT
temp1[temp2, on = "ref"][, prec := V1 < dates][qLess == TRUE,][prec == TRUE,]
预期结果将至少作为参考,并且在其之前或之后,但最好具有数量和日期(对于低于阈值的交易)和前一个日期(如在所提供的示例中)。
另一种仅使用data.table
的非等连接可能性的方法:
setorder(a, ref, dates)
a[qty < 20][a[qty >= 20]
, on = .(ref, dates > dates)
, prev.big.date := i.dates, by = .EACHI][]
这使:
ref dates qty prev.big.date 1: A 2017-09-16 5 2017-09-12 2: A 2017-09-27 16 2017-09-19 3: B 2017-09-17 19 2017-09-16 4: B 2017-09-30 19 2017-09-28 5: B 2017-10-04 6 2017-10-01 6: C 2017-08-14 6 2017-08-12 7: C 2017-10-08 1 2017-10-01 8: C 2017-10-24 18 2017-10-22 9: D 2017-10-20 7 2017-10-18 10: F 2017-10-20 11 2017-10-11 11: F 2017-11-23 18 2017-11-22 12: G 2017-11-15 15 2017-11-12 13: H 2017-09-30 14 2017-09-28 14: H 2017-10-05 16 2017-09-28 15: H 2017-10-29 18 2017-10-26 16: I 2017-10-27 9 2017-10-19 17: J 2017-09-23 3 2017-09-17
这非常简单。我们设置密钥按ref和date排序,然后用1
标记“大”订单,在大订单之前设置小订单的NA
和大订单的日期,然后向前填写大订单日期。结果包含每个订单的最新大订单,如果没有先前的大订单,则为缺失值。
setkey(a, ref, dates)
a[, is_big := (qty >= 20) + 0L]
a[is_big == 1, preceding_big_date := dates]
a[, preceding_big_date := zoo::na.locf(preceding_big_date), by = ref]
new_result = a[is_big == 0, ]