我有一个 df 有 chr 和位置,另一个有 chr、start、end。我想找到 df1 中与 chr 匹配并与 df0 位置重叠的所有区域。请参阅下面的示例以更加清楚地了解。仅供参考,我有很大的 dfs
df0:
chr position
1 33
1 100
1 400
df1:
chr start end label
1 30 40 dog
1 90 110 dog
1 85 200 cat
这就是我想要的:
final_df:
chr position matched_start matched_end label
1 33 30 40 dog
1 100 90 110 dog
1 100 85 200 cat
您可以首先在
chr
上合并 dfs,然后在合并的数据帧上应用匹配逻辑。您可以尝试以下方法:
import pandas as pd
# Example dataframes
df0 = pd.DataFrame({'chr': [1, 1, 1], 'position': [33, 100, 400]})
df1 = pd.DataFrame({'chr': [1, 1, 1], 'start': [30, 90, 85], 'end': [40, 110, 200], 'label': ['dog', 'dog', 'cat']})
# Create a helper function to check for overlap
def find_overlap(pos, start, end):
if pos >= start and pos <= end:
return start, end
return pd.Series([pd.NA, pd.NA])
# Merge the dataframes on 'chr'
merged_df = pd.merge(df0, df1, on='chr', how='outer')
# Apply the helper function to each row and create new columns
merged_df[['matched_start', 'matched_end']] = merged_df.apply(lambda row: find_overlap(row['position'], row['start'], row['end']), axis=1, result_type='expand')
# Drop rows with NaN values in the new columns
final_df = merged_df.dropna(subset=['matched_start', 'matched_end'])
# Reorder and rename columns
final_df = final_df[['chr', 'position', 'matched_start', 'matched_end', 'label']]
print(final_df)
上述代码的输出如下:
chr position matched_start matched_end label
0 1 33 30 40 dog
4 1 100 90 110 dog
5 1 100 85 200 cat