大型 Pandas 数据框查找重叠区域

问题描述 投票:0回答:1

我在 pandas 中有一个 Pandas DataFrame,其基因组区域由其染色体、起始位置和终止位置表示。我试图识别同一染色体内的重叠区域,并将它们与相应的标签一起编译。我不确定我正在做的方式是否正确 - 我也想要一种有效的方法,因为我的 df 非常大(300 万行),所以 for 循环并不理想。

这是示例 df 和预期输出 df:

import pandas as pd

# Sample DataFrame
data = {
    'chromosome': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
    'start': [10, 15, 35, 45, 55],
    'stop': [20, 25, 55, 56, 60],
    'hg_38_locs': ['chr1:10-20', 'chr1:15-25', 'chr1:35-55', 'chr1:45-56', 'chr1:55-60'],
    'main_category': ['label1', 'label2', 'label2', 'label3', 'label1']
}

Output:

     overlapping_regions              overlapping_labels
0    (chr1:10-20, chr1:15-25)        (label1, label2)
1    (chr1:10-20, chr1:35-55)        (label1, label2)
2    (chr1:15-25, chr1:35-55)        (label2, label2)
3    (chr1:35-55, chr1:45-56)        (label2, label3)
4    (chr1:45-56, chr1:55-60)        (label3, label1)
python pandas dataframe bioinformatics
1个回答
0
投票

我认为您在问题中发布的输出是错误的。简单地看一下区间树和

start
stop
值。如果你做这个练习,你会发现他输出的你发布的不匹配。我建议您执行以下操作。

import pandas as pd
from intervaltree import Interval, IntervalTree

data = {
    'chromosome': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
    'start': [10, 15, 35, 45, 55],
    'stop': [20, 25, 55, 56, 60],
    'hg_38_locs': ['chr1:10-20', 'chr1:15-25', 'chr1:35-55', 'chr1:45-56', 'chr1:55-60'],
    'main_category': ['label1', 'label2', 'label2', 'label3', 'label1']
}
df = pd.DataFrame(data)

def find_overlaps(df):
    results = []
    for chromosome, group in df.groupby('chromosome'):
        tree = IntervalTree()
        for _, row in group.iterrows():
            tree[row['start']:row['stop']] = (row['hg_38_locs'], row['main_category'])

        for interval in tree:
            overlaps = tree.overlap(interval.begin, interval.end)
            if len(overlaps) > 1:
                overlapping_regions = tuple(ov.data[0] for ov in overlaps)
                overlapping_labels = tuple(ov.data[1] for ov in overlaps)
                if (overlapping_regions, overlapping_labels) not in results:
                    results.append((overlapping_regions, overlapping_labels))

    return pd.DataFrame(results, columns=['overlapping_regions', 'overlapping_labels'])

output_df = find_overlaps(df)
print(output_df)

这给出了

                    overlapping_regions        overlapping_labels
0              (chr1:35-55, chr1:45-56)          (label2, label3)
1              (chr1:15-25, chr1:10-20)          (label2, label1)
2  (chr1:45-56, chr1:35-55, chr1:55-60)  (label3, label2, label1)
3              (chr1:45-56, chr1:55-60)          (label3, label1)

即使对于大型数据框,这也应该有效。如果您仍然觉得速度很慢,您可以使用

concurrent.futures
中的
ProcessPoolExecutor

© www.soinside.com 2019 - 2024. All rights reserved.