我有一个包含行对(源和目标)的 DataFrame,我想确定每对是否匹配。我需要添加一个新列来指示该对是否通过或未通过匹配条件。
Obs | Dataset | Col1 | Col2 | Col3
----------------------------------
1 | Source | A | 10 | X
2 | Target | A | 10 | X
3 | Source | B | 20 | Y
4 | Target | B | 20 | Y
5 | Source | C | 30 | Z
6 | Target | D | 30 | Z
我想要的输出:
Obs | Dataset | Result | Col1 | Col2 | Col3
--------------------------------------------
1 | Source | Pass | A | 10 | X
2 | Target | | A | 10 | X
3 | Source | Pass | B | 20 | Y
4 | Target | | B | 20 | Y
5 | Source | Fail | C | 30 | Z
6 | Target | | D | 30 | Z
代码:
import pandas as pd
from openpyxl import load_workbook
from openpyxl.styles import PatternFill
class ExcelHighlighter:
def __init__(self, file_path, sheet_name):
self.file_path = file_path
self.sheet_name = sheet_name
self.light_green_fill = PatternFill(start_color='00FF00', end_color='00FF00', fill_type='solid')
self.light_coral_fill = PatternFill(start_color='FF8080', end_color='FF8080', fill_type='solid')
def highlight_and_save(self, output_path='output.xlsx'):
df = pd.read_excel(self.file_path, sheet_name=self.sheet_name)
with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
df.to_excel(writer, index=False, sheet_name=self.sheet_name)
workbook = writer.book
sheet = writer.sheets[self.sheet_name]
for row in range(2, df.shape[0] + 1, 2):
for col in range(3, df.shape[1]):
cell_value_source = df.iloc[row - 2, col]
cell_value_target = df.iloc[row - 1, col]
if cell_value_source == cell_value_target:
sheet.cell(row=row - 1, column=col + 1).fill = self.light_green_fill
sheet.cell(row=row, column=col + 1).fill = self.light_green_fill
elif cell_value_source != cell_value_target:
sheet.cell(row=row - 1, column=col + 1).fill = self.light_coral_fill
sheet.cell(row=row, column=col + 1).fill = self.light_coral_fill
workbook.save(output_path)
from Highlighter import ExcelHighlighter
highlighter = ExcelHighlighter('input.xlsx', 'Sheet1')
highlighter.highlight_and_save()
期望我如何添加“结果”(作为第三列)?
解决方案在于识别所有“通过”(即“源”和“目标”中存在的),并将所有其他定义为“失败”。
# break up the input into 2 separate dataframes
source_df = df[df['Dataset'] == 'Source']
target_df = df[df['Dataset'] == 'Target']
# apply pd.merge, which will only keep those records that are prsent in both
pass_df = pd.merge(source_df, taget_df, how='inner', on=['Col1', 'Col2', 'Col3'])
所有其他人都“失败”。