从数据表中按间隔删除重叠事件

问题描述 投票:0回答:1

给定具有位置和时间戳的大型(> 100MB)事件数据帧,如何删除 RMATLABPython 中所有位置同步发生的事件(即假定噪声)(具有合理的性能)?

R中问题的最小规范是:

pixel <- c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3)
start <- c(1, 3, 6, 8, 1, 3, 5, 7, 8, 1, 4, 7)
end   <- c(2, 4, 7, 9, 2, 4, 6, 8, 9, 3, 5, 9)

events <- data.frame(cbind(pixel, start, end))

# there was an event between 1 and 2s detected everywhere;
# this event would therefore be removed in the desired output:
#
#  pixel start end
#      1     3   4
#      1     6   7
#      1     8   9
#      2     3   4
#      2     5   6
#      2     7   8
#      2     8   9
#      3     4   5
#      3     7   9

我曾尝试用循环来解决这个问题,但是解决的速度很慢。 (专家有时建议“向量化”计算,但我发现没有办法摆脱循环。)

此外,我在 PythonPandas Data Frame - Remove Overlapping Intervals 上找到了该问题的相关帖子。

在我看来,这类问题应该是一个常见的问题,并且可能已经通过一个包解决了,但我找不到它。

python r matlab vectorization intervals
1个回答
0
投票

我认为您的预期输出不完整,并且行数比应有的多。也就是说,所有三个

pixel
1, 2
8, 9
之间都有一个事件,因此我们应该从每个
pixel
中删除两行。

这是一个

data.table
解决方案。请注意,由于我们希望比较是右侧开放的(即
1, 3
不与
3, 4
重叠),因此我将暂时将
end
减少一点点,设置键(
foverlaps
需要) ),检查是否有重叠,然后返回我减去的 iota。

library(data.table)
events <- data.table(pixel, start, end)

# subtract an iota from `end`, needed for right-side-open
iota <- 1e-9
events[, end := end - iota]
setkey(events, start, end)
events[, overlaps := foverlaps(.SD, events)[, uniqueN(pixel), by = c("i.start", "i.end")]$V1, by = .(pixel)]
# Key: <start, end>
#     pixel start   end overlaps
#     <num> <num> <num>    <int>
#  1:     1     1     2        3
#  2:     2     1     2        3
#  3:     3     1     3        3
#  4:     1     3     4        2
#  5:     2     3     4        2
#  6:     3     4     5        1
#  7:     2     5     6        1
#  8:     1     6     7        1
#  9:     2     7     8        2
# 10:     3     7     9        3
# 11:     1     8     9        3
# 12:     2     8     9        3

overlaps
列现在表示在重叠时间范围集中找到的唯一
pixel
值总数(包括“self”)的计数。当此数字与唯一
pixel
值的总数相同时,我们的行与所有其他组重叠。

out <- events[uniqueN(pixel) > overlaps, ][, end := end + iota]
setorder(out, pixel, start, end)
out
#    pixel start   end overlaps
#    <num> <num> <num>    <int>
# 1:     1     3     4        2
# 2:     1     6     7        1
# 3:     2     3     4        2
# 4:     2     5     6        1
# 5:     2     7     8        2
# 6:     3     4     5        1

后续证明,逐行:

    pixel start   end
    <num> <num> <num>
 1:     1     1     2  # overlaps row 5, 10      ALL3
 2:     1     3     4  # overlaps row 6
 3:     1     6     7  # no overlaps
 4:     1     8     9  # overlaps row 9, 12      ALL3
 5:     2     1     2  # overlaps row 1, 10      ALL3
 6:     2     3     4  # overlaps row 2
 7:     2     5     6  # no overlaps
 8:     2     7     8  # no overlaps
 9:     2     8     9  # overlaps row 4, 12      ALL3
10:     3     1     3  # overlaps row 1, 5
11:     3     4     5  # no overlaps
12:     3     7     9  # overlaps rot 4, 12      ALL3

ALL3
行应该被删除(根据我对你的逻辑的解释)。

© www.soinside.com 2019 - 2024. All rights reserved.