我可以让 Pandas 的数据框连接更简洁吗？

Question

我正在尝试创建一个功能，允许用户输入大小盘列表，然后将该列表加入到一系列投影系统中，以查看其比较情况（即玩家的击球大小将是与六种不同投影系统的投影命中进行比较）。我遇到的问题是，我不希望我的代码根据用户选择的上/下类型重复步骤；我想让它更加简洁/多功能。

我目前对该功能的逻辑是，用户可以选择三种不同的输入（安打、本垒打、打点）。对于每个输入，都有一个 if 语句（我已将其包含在下面的示例中），在每个 if 语句中，我需要通过参数选择对上/下数据帧进行子集化（在下面的示例中是命中），然后加入到所有六个投影数据帧（同时仅选择相关字段），然后计算所有投影系统的平均值以进行参数选择。

有什么方法可以让我的代码能够根据初始参数选择执行上述步骤（而不是再将下面的所有内容再写两次），知道我将从每个投影数据加入的字段帧会根据参数而有所不同（即，如果选择 RBI 参数，则不是 h_pecota_50，而是 rbi_pectoa_50）？我觉得某种 for 循环可能是可能的，但不确定它的结构，或者这个策略是否最有意义。抱歉啰嗦，如有任何建议，我们将不胜感激。

if investment_type_sub.lower() == 'hits':
        
        #Select columns (IDs) that will need to join to projections (BPID, IDFANGRAPHS, DAVENPORTID)
        ou_hits_all_projections = ou_hitting_ids[['BPID','IDFANGRAPHS', 'DAVENPORTID', 'Player', 'Hits_OU']]

        ou_hits_all_projections = ou_hits_all_projections.query('Hits_OU > 0')

        #Join to BP 50th percentile
        #Merge PECOTA percentile datasets to the OU hits dataframe
        ou_hits_all_projections = pd.merge(ou_hits_all_projections, pecota_hitting_50, how = 'left', left_on='BPID', right_on='bpid_PECOTA_50')

        #Bring in only rows that you need for each projection system
        ou_hits_all_projections = ou_hits_all_projections[['BPID','IDFANGRAPHS', 'DAVENPORTID', 'Player', 'Hits_OU','h_PECOTA_50']]

        #Join to BP 99th percentile
        #Merge PECOTA percentile datasets to the OU strikeouts dataframe
        ou_hits_all_projections = pd.merge(ou_hits_all_projections, pecota_hitting_99, how = 'left', left_on='BPID', right_on='bpid_PECOTA_99')

        #Bring in only rows that you need for each projection system
        ou_hits_all_projections = ou_hits_all_projections[['BPID','IDFANGRAPHS', 'DAVENPORTID', 'Player', 'Hits_OU','h_PECOTA_50', 'h_PECOTA_99']]

        #Merge ZIPs dataset to the OU hits dataframe
        ou_hits_all_projections = pd.merge(ou_hits_all_projections, zips_hitting, how = 'left', left_on='IDFANGRAPHS', right_on='PlayerId_ZIPs')

        #Bring in only rows that you need for each projection system
        ou_hits_all_projections = ou_hits_all_projections[['BPID','IDFANGRAPHS', 'DAVENPORTID', 'Player', 'Hits_OU', 'h_PECOTA_50','h_PECOTA_99', 'H_ZIPs']]

        #Merge Steamer dataset to the OU hits dataframe
        ou_hits_all_projections = pd.merge(ou_hits_all_projections, steamer_hitting, how = 'left', left_on='IDFANGRAPHS', right_on='PlayerId_steamer')

        #Bring in only rows that you need for each projection system
        ou_hits_all_projections = ou_hits_all_projections[['BPID','IDFANGRAPHS', 'DAVENPORTID', 'Player', 'Hits_OU', 'h_PECOTA_50','h_PECOTA_99', 'H_ZIPs', 'H_steamer']]

        #Merge the_bat dataset to the OU hits dataframe
        ou_hits_all_projections = pd.merge(ou_hits_all_projections, the_bat_hitting, how = 'left', left_on='IDFANGRAPHS', right_on='PlayerId_the_bat')

        #Bring in only rows that you need for each projection system
        ou_hits_all_projections = ou_hits_all_projections[['BPID','IDFANGRAPHS', 'DAVENPORTID', 'Player', 'Hits_OU', 'h_PECOTA_50','h_PECOTA_99', 'H_ZIPs', 'H_steamer', 'H_the_bat']]

        #Merge the_bat dataset to the OU hits dataframe
        ou_hits_all_projections = pd.merge(ou_hits_all_projections, davenport_hitting, how = 'left', left_on='DAVENPORTID', right_on='HOWEID_davenport')

        #Bring in only rows that you need for each projection system
        ou_hits_all_projections = ou_hits_all_projections[['BPID','IDFANGRAPHS', 'DAVENPORTID', 'Player', 'Hits_OU', 'h_PECOTA_50','h_PECOTA_99', 'H_ZIPs', 'H_steamer', 'H_the_bat', 'H_davenport']]

        #Calculates the average for all projection systems in the relevant O/U field; try and eventually find a way to get what you dividing by an automatic calc
        ou_hits_all_projections['hit_avg'] = np.where((ou_hits_all_projections.Hits_OU > 0),
                                             ((ou_hits_all_projections.h_PECOTA_50+
                                               ou_hits_all_projections.h_PECOTA_99+
                                               ou_hits_all_projections.H_ZIPs
                                              +ou_hits_all_projections.H_steamer + ou_hits_all_projections.H_the_bat +
                                               ou_hits_all_projections.H_davenport))/
                                          ((len(ou_hits_all_projections.columns)-5)), 0)
        
        #Adds in field taking the % difference between the average and the O/U
        ou_hits_all_projections['hit_avg_diff'] = (ou_hits_all_projections['Hits_OU']-ou_hits_all_projections['hit_avg'])/(ou_hits_all_projections['Hits_OU'])
        
        return ou_hits_all_projections.sort_values('hit_avg_diff', ascending=False).reset_index(drop=True)

Answer 1

鉴于任何来回都会受到限制......我会做出一些假设。

不要将数据帧 (df) 视为有条件的，而是从 ETL 解决方案如何工作的角度来看待。提取数据、您的初始 df，并努力创建原始 df 的转换以满足条件。根据条件的满足情况，您可以访问适当的转换。总之，消除用户输入的影响，以便基于对原始 df 进行适当的转换来驱动条件以满足结果。

我可以让 Pandas 的数据框连接更简洁吗？

问题描述投票：0回答：1

1个回答

最新问题

我可以让 Pandas 的数据框连接更简洁吗？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1