合并两个 pandas 数据框并进行保护

问题描述 投票:0回答:1

我正在使用两个不同的数据框,

AlphaDF
BetaDF
,每个数据框都包含唯一的列。我的目标是将
AlphaDF
BetaDF
中的数据合并,保留
AlphaDF
中的每一行,并在匹配的地方添加
BetaDF
中的相应数据。

这是我最初编写的脚本:

combined_rows = []
for index, alpha_row in AlphaDF.iterrows():
    match = False
    for b_index, beta_row in BetaDF.iterrows():
        if alpha_row["alpha_id"] == beta_row["beta_id"] and alpha_row['start_point'] >= beta_row['range_start'] and alpha_row['end_point'] <= beta_row['range_end']:
            match = True
            combined_rows.append(alpha_row.tolist() + beta_row.tolist())
            break

    if not match:
        combined_rows.append(alpha_row.tolist() + [np.nan] * len(BetaDF.columns))

merged_dataframe = pd.DataFrame(combined_rows, columns=AlphaDF.columns.tolist() + BetaDF.columns.tolist())

为了优化这一点,我尝试了 pandas 中的矢量化操作:

BetaDF = BetaDF.rename(columns={'beta_id': 'alpha_id'})
merged_dataframe = pd.merge(AlphaDF, BetaDF, how='left', on='alpha_id')

condition = (merged_dataframe['start_point'] >= merged_dataframe['range_start']) & (merged_dataframe['end_point'] <= merged_dataframe['range_end'])
merged_dataframe = merged_dataframe[condition]

merged_dataframe.fillna(np.nan, inplace=True)

但是,此方法不会保留

AlphaDF
中的每一行。这是数据框的示例:

阿尔法DF:

alpha_id, start_point, end_point, reference, score, strand, feature_count, label, value1, value2, value3
A123, 1000, 1050, ref:A123:1000-1050, 0.75, +, 2, primary, NaN, NaN, NaN
A124, 1070, 1100, ref:A124:1070-1100, 0.80, -, 1, secondary, NaN, NaN, NaN
...

BetaDF:

gene_id, alpha_id, range_start, range_end, orientation, is_partial, gene_type, data_version, data_source, gene_function, significance
101, A123, 950, 1075, forward, 0, coding, v1, SourceA, FunctionX, 0.05
102, A124, 1060, 1120, reverse, 1, non-coding, v1, SourceB, FunctionY, 0.03
...

关于使用 pandas 操作来实现此合并任务的更有效和更准确的方法有什么建议吗?

pandas vectorization
1个回答
0
投票

发生这种情况是因为 merged_dataframe[condition] 并不具有满足条件的所有 AlphaDF 行。

您可以首先组合 和 ,使用相反的条件将所有 BetaDF 列设置为空值。

import pandas as pd
import numpy as np

pd.set_option('display.max_rows', None)  
pd.options.display.expand_frame_repr = False

BetaDF = BetaDF.rename(columns={'beta_id': 'alpha_id'})
merged_dataframe = pd.merge(AlphaDF, BetaDF, how='left', on='alpha_id')


condition = ((merged_dataframe['start_point'] >= merged_dataframe['range_start']) &
             (merged_dataframe['end_point'] <= merged_dataframe['range_end']))

merged_dataframe.loc[~condition, BetaDF.columns[0]:] = np.nan

print(merged_dataframe)

此外,如果两个数据框中的行数相同,您可以创建从

0
len(AlphaDF) - 1
的“bl”列。在
BetaDF['bl']
中设置条件不匹配的空值。然后按列合并。

AlphaDF['bl'] = np.arange(len(AlphaDF), dtype=float)
condition = ((AlphaDF['start_point'] >= BetaDF['range_start']) &
             (AlphaDF['end_point'] <= BetaDF['range_end']))

BetaDF['bl'] = np.arange(len(BetaDF), dtype=float)
BetaDF.loc[~condition, 'bl'] = np.nan

merged_dataframe = pd.merge(AlphaDF, BetaDF, how='left', on='bl')
© www.soinside.com 2019 - 2024. All rights reserved.