我正在使用两个不同的数据框,
AlphaDF
和BetaDF
,每个数据框都包含唯一的列。我的目标是将 AlphaDF
与 BetaDF
中的数据合并,保留 AlphaDF
中的每一行,并在匹配的地方添加 BetaDF
中的相应数据。
这是我最初编写的脚本:
combined_rows = []
for index, alpha_row in AlphaDF.iterrows():
match = False
for b_index, beta_row in BetaDF.iterrows():
if alpha_row["alpha_id"] == beta_row["beta_id"] and alpha_row['start_point'] >= beta_row['range_start'] and alpha_row['end_point'] <= beta_row['range_end']:
match = True
combined_rows.append(alpha_row.tolist() + beta_row.tolist())
break
if not match:
combined_rows.append(alpha_row.tolist() + [np.nan] * len(BetaDF.columns))
merged_dataframe = pd.DataFrame(combined_rows, columns=AlphaDF.columns.tolist() + BetaDF.columns.tolist())
为了优化这一点,我尝试了 pandas 中的矢量化操作:
BetaDF = BetaDF.rename(columns={'beta_id': 'alpha_id'})
merged_dataframe = pd.merge(AlphaDF, BetaDF, how='left', on='alpha_id')
condition = (merged_dataframe['start_point'] >= merged_dataframe['range_start']) & (merged_dataframe['end_point'] <= merged_dataframe['range_end'])
merged_dataframe = merged_dataframe[condition]
merged_dataframe.fillna(np.nan, inplace=True)
但是,此方法不会保留
AlphaDF
中的每一行。这是数据框的示例:
阿尔法DF:
alpha_id, start_point, end_point, reference, score, strand, feature_count, label, value1, value2, value3
A123, 1000, 1050, ref:A123:1000-1050, 0.75, +, 2, primary, NaN, NaN, NaN
A124, 1070, 1100, ref:A124:1070-1100, 0.80, -, 1, secondary, NaN, NaN, NaN
...
BetaDF:
gene_id, alpha_id, range_start, range_end, orientation, is_partial, gene_type, data_version, data_source, gene_function, significance
101, A123, 950, 1075, forward, 0, coding, v1, SourceA, FunctionX, 0.05
102, A124, 1060, 1120, reverse, 1, non-coding, v1, SourceB, FunctionY, 0.03
...
关于使用 pandas 操作来实现此合并任务的更有效和更准确的方法有什么建议吗?
发生这种情况是因为 merged_dataframe[condition] 并不具有满足条件的所有 AlphaDF 行。
您可以首先组合 和 ,使用相反的条件将所有 BetaDF 列设置为空值。
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', None)
pd.options.display.expand_frame_repr = False
BetaDF = BetaDF.rename(columns={'beta_id': 'alpha_id'})
merged_dataframe = pd.merge(AlphaDF, BetaDF, how='left', on='alpha_id')
condition = ((merged_dataframe['start_point'] >= merged_dataframe['range_start']) &
(merged_dataframe['end_point'] <= merged_dataframe['range_end']))
merged_dataframe.loc[~condition, BetaDF.columns[0]:] = np.nan
print(merged_dataframe)
此外,如果两个数据框中的行数相同,您可以创建从
0
到 len(AlphaDF) - 1
的“bl”列。在 BetaDF['bl']
中设置条件不匹配的空值。然后按列合并。
AlphaDF['bl'] = np.arange(len(AlphaDF), dtype=float)
condition = ((AlphaDF['start_point'] >= BetaDF['range_start']) &
(AlphaDF['end_point'] <= BetaDF['range_end']))
BetaDF['bl'] = np.arange(len(BetaDF), dtype=float)
BetaDF.loc[~condition, 'bl'] = np.nan
merged_dataframe = pd.merge(AlphaDF, BetaDF, how='left', on='bl')