我正在使用pandas(0.25.3)和Python(3.7.4)。我正在使用类似于下面的[[df1的DataFrame。我需要根据同一DataFrame中“支付代码”字段的值,有条件地将“小时”和“工资”字段转换为“总工时”,“总工资”,“常规工资”字段。我还需要按“检查日期”分组。
df1 = pd.DataFrame( {
"Pay Code" : ["1","4","OCH","3","3"],
"Check Date" : ["2019-01-04","2019-01-04","2019-01-04","2019-01-04","2019-01-18"],
"Pay Start Date" : ["2018-12-15","2018-12-15","2018-12-15","2018-12-15","2018-12-29"],
"Pay End Date" : ["2018-12-28","2018-12-28","2018-12-28","2018-12-28","2019-01-11"],
"Pay Code Description" : ["REGULAR PAY","HOLIDAY PAY","ON CALL HOURLY","VACATION PAY","VACATION PAY"],
"Hours" : [46.0,16.0,152.0,18.0,19.5],
"Wages" : [1226.58,426.64,63.33,479.98,530.38],
"Gross Hours" : ["NaN","NaN","NaN","NaN","NaN"],
"Regular Wages" : ["NaN","NaN","NaN","NaN","NaN"],
"Overtime Wages" : ["NaN","NaN","NaN","NaN","NaN"]
} )
假设我有静态列表用作参考,以确定应将值转换为哪一列。
GrossHours = ['1','2','3'] RegularWages = ['1','3','4'] OvertimeWages = ['2','OCH']
所需的将是此DataFrame结果
df_result = pd.DataFrame( {
"Check Date" : ["2019-01-04","2019-01-18"],
"Pay Start Date" : ["2018-12-15","2018-12-29"],
"Pay End Date" : ["2018-12-28","2019-01-11"],
"Hours" : [232,19.5],
"Wages" : [2196.53,530.38],
"Gross Hours" : [64.0,19.5],
"Regular Wages" : [2133.2,530.38],
"Overtime Wages" : [63.33,"NaN"]
} )
我正在尝试什么?我曾尝试对应用大量的lambda函数,以根据需要提供结果,但是我不确定如何将这些结果对象干净地返回到原始DataFrame df1。是否是制作一堆中间DataFrame的唯一选择,然后将这些DataFrame加入或合并回原始文件,然后再次进行groupby编辑?df1
g1 = df1.groupby(["Check Date"])
g1.apply(lambda x: x[x['Pay Code'].isin(GrossHours)]['Hours'].astype(float).sum())
Check Date
2019-01-04 64.0
2019-01-18 19.5
dtype: float64
transformations = [('Gross_Hours', ['1','2','3']), ('Regular_Wages', ['1','3','4']), ('Overtime_Wages', ['2','OCH'])]
我还定义了我期望的输出数据帧的结构。
result_dataframe_fields = ['Check Date', 'Pay Start Date','Pay End Date','Gross Hours', 'Regular Wages', 'Overtime Wages']
通过将@Datanovice的建议应用于与我已经走过的路类似的路径,最终得到的结果是尽可能清晰和可读的。
# Instatiate result dataframe df_result = df1.groupby(result_dataframe_fields).sum().reset_index() for t_ix, t_list in transformations: # Create aggregated set to populate result dataframe if t_ix == 'Gross_Hours': g1 = df1.loc[df1['Pay Code'].isin(t_list)].groupby('Check Date')['Hours'].agg(temp_col_name='sum') g2 = g1.reset_index() g2.columns = ['Check Date', t_ix] else: g1 = df1.loc[df1['Pay Code'].isin(t_list)].groupby('Check Date')['Wages'].agg(temp_col_name='sum') g2 = g1.reset_index() g2.columns = ['Check Date', t_ix] #Handle the .agg() column naming limitation (no spaces on list agg) colsg2 = g2.columns colsg2 = colsg2.map(lambda x: x.replace('_', ' ') if isinstance(x, (str)) else x) g2.columns = colsg2 # Dataframe copy that will update result dataframe update_df = g2.copy() df_result.update(update_df)
Result Image From Jupyter Lab我仍然希望这不是最佳答案,因为我的实际应用程序要比这大得多,并且看起来相当可怕,超出了我的“实际代码”规模。