在Pandas数据框中对数据进行分组和有条件地转换的最干净的方法是什么?

问题描述 投票:0回答:1

我正在使用pandas(0.25.3)和Python(3.7.4)。我正在使用类似于下面的[[df1的DataFrame。我需要根据同一DataFrame中“支付代码”字段的值,有条件地将“小时”和“工资”字段转换为“总工时”,“总工资”,“常规工资”字段。我还需要按“检查日期”分组。

df1 = pd.DataFrame( { "Pay Code" : ["1","4","OCH","3","3"], "Check Date" : ["2019-01-04","2019-01-04","2019-01-04","2019-01-04","2019-01-18"], "Pay Start Date" : ["2018-12-15","2018-12-15","2018-12-15","2018-12-15","2018-12-29"], "Pay End Date" : ["2018-12-28","2018-12-28","2018-12-28","2018-12-28","2019-01-11"], "Pay Code Description" : ["REGULAR PAY","HOLIDAY PAY","ON CALL HOURLY","VACATION PAY","VACATION PAY"], "Hours" : [46.0,16.0,152.0,18.0,19.5], "Wages" : [1226.58,426.64,63.33,479.98,530.38], "Gross Hours" : ["NaN","NaN","NaN","NaN","NaN"], "Regular Wages" : ["NaN","NaN","NaN","NaN","NaN"], "Overtime Wages" : ["NaN","NaN","NaN","NaN","NaN"] } )
假设我有静态列表用作参考,以确定应将值转换为哪一列。 

GrossHours = ['1','2','3'] RegularWages = ['1','3','4'] OvertimeWages = ['2','OCH']

所需的

结果

将是此DataFramedf_result = pd.DataFrame( { "Check Date" : ["2019-01-04","2019-01-18"], "Pay Start Date" : ["2018-12-15","2018-12-29"], "Pay End Date" : ["2018-12-28","2019-01-11"], "Hours" : [232,19.5], "Wages" : [2196.53,530.38], "Gross Hours" : [64.0,19.5], "Regular Wages" : [2133.2,530.38], "Overtime Wages" : [63.33,"NaN"] } )
我正在尝试什么?我曾尝试对

df1

应用大量的lambda函数,以根据需要提供结果,但是我不确定如何将这些结果对象干净地返回到原始DataFrame df1。是否是制作一堆中间DataFrame的唯一选择,然后将这些DataFrame加入或合并回原始文件,然后再次进行groupby编辑?g1 = df1.groupby(["Check Date"]) g1.apply(lambda x: x[x['Pay Code'].isin(GrossHours)]['Hours'].astype(float).sum()) Check Date 2019-01-04 64.0 2019-01-18 19.5 dtype: float64
python pandas dataframe
1个回答
0
投票
首先,我建立了一个元组列表以进行迭代。

transformations = [('Gross_Hours', ['1','2','3']), ('Regular_Wages', ['1','3','4']), ('Overtime_Wages', ['2','OCH'])]

我还定义了我期望的输出数据帧的结构。

result_dataframe_fields = ['Check Date', 'Pay Start Date','Pay End Date','Gross Hours', 'Regular Wages', 'Overtime Wages']

通过将@Datanovice的建议应用于与我已经走过的路类似的路径,最终得到的结果是尽可能清晰和可读的。

# Instatiate result dataframe df_result = df1.groupby(result_dataframe_fields).sum().reset_index() for t_ix, t_list in transformations: # Create aggregated set to populate result dataframe if t_ix == 'Gross_Hours': g1 = df1.loc[df1['Pay Code'].isin(t_list)].groupby('Check Date')['Hours'].agg(temp_col_name='sum') g2 = g1.reset_index() g2.columns = ['Check Date', t_ix] else: g1 = df1.loc[df1['Pay Code'].isin(t_list)].groupby('Check Date')['Wages'].agg(temp_col_name='sum') g2 = g1.reset_index() g2.columns = ['Check Date', t_ix] #Handle the .agg() column naming limitation (no spaces on list agg) colsg2 = g2.columns colsg2 = colsg2.map(lambda x: x.replace('_', ' ') if isinstance(x, (str)) else x) g2.columns = colsg2 # Dataframe copy that will update result dataframe update_df = g2.copy() df_result.update(update_df)

Result Image From Jupyter Lab

我仍然希望这不是最佳答案,因为我的实际应用程序要比这大得多,并且看起来相当可怕,超出了我的“实际代码”规模。

© www.soinside.com 2019 - 2024. All rights reserved.