给定具有位置和时间戳的大型(> 100MB)事件数据帧,如何删除 R、MATLAB 或 Python 中所有位置同步发生的事件(即假定噪声)(具有合理的性能)?
R中问题的最小规范是:
pixel <- c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3)
start <- c(1, 3, 6, 8, 1, 3, 5, 7, 8, 1, 4, 7)
end <- c(2, 4, 7, 9, 2, 4, 6, 8, 9, 3, 5, 9)
events <- data.frame(cbind(pixel, start, end))
# there was an event between 1 and 2s detected everywhere;
# this event would therefore be removed in the desired output:
#
# pixel start end
# 1 3 4
# 1 6 7
# 1 8 9
# 2 3 4
# 2 5 6
# 2 7 8
# 2 8 9
# 3 4 5
# 3 7 9
我曾尝试用循环来解决这个问题,但是解决的速度很慢。 (专家有时建议“向量化”计算,但我发现没有办法摆脱循环。)
此外,我在 Python 的 Pandas Data Frame - Remove Overlapping Intervals 上找到了该问题的相关帖子。
在我看来,这类问题应该是一个常见的问题,并且可能已经通过一个包解决了,但我找不到它。
我认为您的预期输出不完整,并且行数比应有的多。也就是说,所有三个
pixel
在 1, 2
和 8, 9
之间都有一个事件,因此我们应该从每个 pixel
中删除两行。
这是一个
data.table
解决方案。请注意,由于我们希望比较是右侧开放的(即 1, 3
不与 3, 4
重叠),因此我将暂时将 end
减少一点点,设置键(foverlaps
需要) ),检查是否有重叠,然后返回我减去的 iota。
library(data.table)
events <- data.table(pixel, start, end)
# subtract an iota from `end`, needed for right-side-open
iota <- 1e-9
events[, end := end - iota]
setkey(events, start, end)
events[, overlaps := foverlaps(.SD, events)[, uniqueN(pixel), by = c("i.start", "i.end")]$V1, by = .(pixel)]
# Key: <start, end>
# pixel start end overlaps
# <num> <num> <num> <int>
# 1: 1 1 2 3
# 2: 2 1 2 3
# 3: 3 1 3 3
# 4: 1 3 4 2
# 5: 2 3 4 2
# 6: 3 4 5 1
# 7: 2 5 6 1
# 8: 1 6 7 1
# 9: 2 7 8 2
# 10: 3 7 9 3
# 11: 1 8 9 3
# 12: 2 8 9 3
overlaps
列现在表示在重叠时间范围集中找到的唯一 pixel
值总数(包括“self”)的计数。当此数字与唯一 pixel
值的总数相同时,我们的行与所有其他组重叠。
out <- events[uniqueN(pixel) > overlaps, ][, end := end + iota]
setorder(out, pixel, start, end)
out
# pixel start end overlaps
# <num> <num> <num> <int>
# 1: 1 3 4 2
# 2: 1 6 7 1
# 3: 2 3 4 2
# 4: 2 5 6 1
# 5: 2 7 8 2
# 6: 3 4 5 1
后续证明,逐行:
pixel start end
<num> <num> <num>
1: 1 1 2 # overlaps row 5, 10 ALL3
2: 1 3 4 # overlaps row 6
3: 1 6 7 # no overlaps
4: 1 8 9 # overlaps row 9, 12 ALL3
5: 2 1 2 # overlaps row 1, 10 ALL3
6: 2 3 4 # overlaps row 2
7: 2 5 6 # no overlaps
8: 2 7 8 # no overlaps
9: 2 8 9 # overlaps row 4, 12 ALL3
10: 3 1 3 # overlaps row 1, 5
11: 3 4 5 # no overlaps
12: 3 7 9 # overlaps rot 4, 12 ALL3
ALL3
行应该被删除(根据我对你的逻辑的解释)。