我在 pandas 中有一个 Pandas DataFrame,其基因组区域由其染色体、起始位置和终止位置表示。我试图识别同一染色体内的重叠区域,并将它们与相应的标签一起编译。我不确定我正在做的方式是否正确 - 我也想要一种有效的方法,因为我的 df 非常大(300 万行),所以 for 循环并不理想。
这是示例 df 和预期输出 df:
import pandas as pd
# Sample DataFrame
data = {
'chromosome': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
'start': [10, 15, 35, 45, 55],
'stop': [20, 25, 55, 56, 60],
'hg_38_locs': ['chr1:10-20', 'chr1:15-25', 'chr1:35-55', 'chr1:45-56', 'chr1:55-60'],
'main_category': ['label1', 'label2', 'label2', 'label3', 'label1']
}
Output:
overlapping_regions overlapping_labels
0 (chr1:10-20, chr1:15-25) (label1, label2)
1 (chr1:10-20, chr1:35-55) (label1, label2)
2 (chr1:15-25, chr1:35-55) (label2, label2)
3 (chr1:35-55, chr1:45-56) (label2, label3)
4 (chr1:45-56, chr1:55-60) (label3, label1)
我认为您在问题中发布的输出是错误的。简单地看一下区间树和
start
、stop
值。如果你做这个练习,你会发现他输出的你发布的不匹配。我建议您执行以下操作。
import pandas as pd
from intervaltree import Interval, IntervalTree
data = {
'chromosome': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
'start': [10, 15, 35, 45, 55],
'stop': [20, 25, 55, 56, 60],
'hg_38_locs': ['chr1:10-20', 'chr1:15-25', 'chr1:35-55', 'chr1:45-56', 'chr1:55-60'],
'main_category': ['label1', 'label2', 'label2', 'label3', 'label1']
}
df = pd.DataFrame(data)
def find_overlaps(df):
results = []
for chromosome, group in df.groupby('chromosome'):
tree = IntervalTree()
for _, row in group.iterrows():
tree[row['start']:row['stop']] = (row['hg_38_locs'], row['main_category'])
for interval in tree:
overlaps = tree.overlap(interval.begin, interval.end)
if len(overlaps) > 1:
overlapping_regions = tuple(ov.data[0] for ov in overlaps)
overlapping_labels = tuple(ov.data[1] for ov in overlaps)
if (overlapping_regions, overlapping_labels) not in results:
results.append((overlapping_regions, overlapping_labels))
return pd.DataFrame(results, columns=['overlapping_regions', 'overlapping_labels'])
output_df = find_overlaps(df)
print(output_df)
这给出了
overlapping_regions overlapping_labels
0 (chr1:35-55, chr1:45-56) (label2, label3)
1 (chr1:15-25, chr1:10-20) (label2, label1)
2 (chr1:45-56, chr1:35-55, chr1:55-60) (label3, label2, label1)
3 (chr1:45-56, chr1:55-60) (label3, label1)
即使对于大型数据框,这也应该有效。如果您仍然觉得速度很慢,您可以使用
concurrent.futures
中的 ProcessPoolExecutor
。