如何在大熊猫中寻找重叠度最高的区间?

问题描述 投票:0回答:1

假设我有四个区间。

a = pd.Interval(0, 2)
b = pd.Interval(1, 2)
c = pd.Interval(0, 1.1)
d = pd.Interval(-1,-0.5)

最常见的重叠是在1到1. 1之间 因为所有的区间都包含在其中 a, bc 以上四个中的三个)。

如何在给定N个区间的列表中找到这个函数?理想的情况是,我希望有一个函数能够接收上述 4 个区间的列表,并返回 pd.Interval(1, 1.1) 作为输出。

python pandas intervals intersection overlap
1个回答
1
投票

我想你是想得到大多数其他区间所包含的区间,因为最常见的重叠的区间是 (-1,1.1] 而不是 (1,1.1]我不知道是不是我错了,但是用了 pd.IntervalsIndex.overlaps. 我得到了这个。

import pandas as pd
from itertools import product

a = pd.Interval(0, 2)
b = pd.Interval(1, 2)
c = pd.Interval(0, 1.1)
d = pd.Interval(-1,-0.5)
ls=[a,b,c,d]
indexinter = pd.IntervalIndex(ls)
print(indexinter.overlaps(pd.Interval(1, 1.1)))
print('Interval (1,1.1] overlaps ',sum(indexinter.overlaps(pd.Interval(1, 1.1))),' of 4')
print('\n')
print(indexinter.overlaps(pd.Interval(-1, 1.1)))
print('Interval (-1,1.1] overlaps ',sum(indexinter.overlaps(pd.Interval(-1, 1.1))),' of 4')

输出

[ True  True  True False]
Interval (1,1.1] overlaps  3  of 4


[ True  True  True  True]
Interval (-1,1.1] overlaps  4  of 4

不,为了得到大多数其他区间所包含的区间,你可以试试这个。

第一个选项

import pandas as pd
import portion as P

d = pd.Interval(0, 2)
b = pd.Interval(1, 2)
c = pd.Interval(0, 1.1)
a = pd.Interval(-1,-0.5)

def getmaxoverlapwrap(ls,inter):
    intervals=[P.openclosed(i.left, i.right) for i in ls]
    for i in intervals[1:]:
        newinter=inter&i
        if newinter.empty==False:
            inter=newinter     
    return inter
def getmaxoverlap(ls):
    intervals=[P.openclosed(i.left, i.right) for i in ls]
    ls1=[getmaxoverlapwrap(ls,i) for i in intervals]
    index = list(map(lambda x: sum([(x in s) for s in ls]), ls1)).index(max(list(map(lambda x: sum([(x in s) for s in ls]), ls1))))
    return ls1[index]
ls=[a,b,c,d]
indexinter = pd.IntervalIndex(ls)
print(getmaxoverlap(ls))

你必须安装库 portion (以前发布的是 python-intervals),它是一个库,为我们提供了 Python 3.5+ 中的数据结构和区间操作,例如: -支持任何 (可比) 对象的区间, -闭合或开放,有限或 (半) 无限区间, -支持区间集 (原子区间的结合), -自动生成区间的数据结构。-支持任意(可比)对象的区间,-封闭或开放,有限或(半)无限区间,-支持区间集(原子区间的联合),-自动简化区间等。

第二个选项

import pandas as pd
from itertools import product

a = pd.Interval(0, 2)
b = pd.Interval(1, 2)
c = pd.Interval(0, 1.1)
d = pd.Interval(-1,-0.5)

def getmostcontained(ls):
    indexinter = pd.IntervalIndex([pd.Interval(inter.left, inter.right, closed='both') for inter in ls])
    rights = [i.right for i in ls]
    lefts=[i.left for i in ls]
    allcomb = [pd.Interval(i[0],i[1]) for i in list(set(list(product(lefts, rights)))) if i[0]<=i[1]]
    count=0
    maxi=sum(indexinter.contains(allcomb[0].left))+sum(indexinter.contains(allcomb[0].right))
    for i in allcomb:
        newmaxi=sum(indexinter.contains(i.left))+sum(indexinter.contains(i.right))

        if newmaxi>maxi:
            count=allcomb.index(i)

    return allcomb[count]

ls=[a,b,c,d]
print(getmostcontained(ls))

第三种选择

import pandas as pd

a = pd.Interval(0, 2)
b = pd.Interval(1, 2)
c = pd.Interval(0, 1.1)
d = pd.Interval(-1,-0.5)

def getmaxoverlap(ls):
    indexinter = pd.IntervalIndex(ls)
    start=ls[0].left
    end=ls[0].right
    checkstart=sum(indexinter.contains(start+0.00000000099))
    checkend=sum(indexinter.contains(end))


    for i in ls:
        if sum(indexinter.contains(i.left+0.000000000099))>checkstart:
            checkstart=sum(indexinter.contains(i.left+0.000000000099))
            start=i.left
        if sum(indexinter.contains(i.right))>checkend:
            checkend=sum(indexinter.contains(i.right))
            end=i.right

    return pd.Interval(start, end)

ls=[a,b,c,d]
print(getmostcontained(ls))            

所有选项的输出。

(1,1.1]

第一个选项是: 0.025967300000047544 seconds 运行100次的脚本,并采取了 27.19270740000138 seconds 运行100000次的脚本,第二种选择用了 0.15769210000144085 seconds 运行100次的脚本,并采取了 115.13360900000043 seconds 运行100000次的脚本,第三个选项以 0.08970480000061798 seconds 运行100次的脚本,并采取了 107.52801619999991 seconds 来运行100000次的脚本。差别几乎是不明显的也许,至少对于第二个和第三个选项,但这里是我得到的时间戳,只是为了让你知道和选择一个。

© www.soinside.com 2019 - 2024. All rights reserved.