如何找到两个 dfs 之间的区域重叠

问题描述 投票:0回答:1

我有一个 df 有 chr 和位置,另一个有 chr、start、end。我想找到 df1 中与 chr 匹配并与 df0 位置重叠的所有区域。请参阅下面的示例以更加清楚地了解。仅供参考,我有很大的 dfs

df0:

chr     position 
1         33
1         100
1         400 


df1:  
chr      start        end      label
1         30          40       dog
1         90          110      dog
1         85          200      cat

这就是我想要的:

final_df:  

chr     position   matched_start    matched_end    label
1         33             30          40            dog
1         100            90          110           dog
1         100            85          200           cat

python pandas dataframe bioinformatics
1个回答
0
投票

您可以首先在

chr
上合并 dfs,然后在合并的数据帧上应用匹配逻辑。您可以尝试以下方法:

import pandas as pd

# Example dataframes
df0 = pd.DataFrame({'chr': [1, 1, 1], 'position': [33, 100, 400]})
df1 = pd.DataFrame({'chr': [1, 1, 1], 'start': [30, 90, 85], 'end': [40, 110, 200], 'label': ['dog', 'dog', 'cat']})

# Create a helper function to check for overlap
def find_overlap(pos, start, end):
    if pos >= start and pos <= end:
        return start, end
    return pd.Series([pd.NA, pd.NA])

# Merge the dataframes on 'chr'
merged_df = pd.merge(df0, df1, on='chr', how='outer')

# Apply the helper function to each row and create new columns
merged_df[['matched_start', 'matched_end']] = merged_df.apply(lambda row: find_overlap(row['position'], row['start'], row['end']), axis=1, result_type='expand')

# Drop rows with NaN values in the new columns
final_df = merged_df.dropna(subset=['matched_start', 'matched_end'])

# Reorder and rename columns
final_df = final_df[['chr', 'position', 'matched_start', 'matched_end', 'label']]

print(final_df)

上述代码的输出如下:

   chr  position matched_start matched_end label
0    1        33            30          40   dog
4    1       100            90         110   dog
5    1       100            85         200   cat
© www.soinside.com 2019 - 2024. All rights reserved.