我可以让 Pandas 的数据框连接更简洁吗?

问题描述 投票:0回答:1

我正在尝试创建一个功能,允许用户输入大小盘列表,然后将该列表加入到一系列投影系统中,以查看其比较情况(即玩家的击球大小将是与六种不同投影系统的投影命中进行比较)。我遇到的问题是,我不希望我的代码根据用户选择的上/下类型重复步骤;我想让它更加简洁/多功能。

我目前对该功能的逻辑是,用户可以选择三种不同的输入(安打、本垒打、打点)。对于每个输入,都有一个 if 语句(我已将其包含在下面的示例中),在每个 if 语句中,我需要通过参数选择对上/下数据帧进行子集化(在下面的示例中是命中),然后加入到所有六个投影数据帧(同时仅选择相关字段),然后计算所有投影系统的平均值以进行参数选择。

有什么方法可以让我的代码能够根据初始参数选择执行上述步骤(而不是再将下面的所有内容再写两次),知道我将从每个投影数据加入的字段帧会根据参数而有所不同(即,如果选择 RBI 参数,则不是 h_pecota_50,而是 rbi_pectoa_50)?我觉得某种 for 循环可能是可能的,但不确定它的结构,或者这个策略是否最有意义。抱歉啰嗦,如有任何建议,我们将不胜感激。

if investment_type_sub.lower() == 'hits':
        
        #Select columns (IDs) that will need to join to projections (BPID, IDFANGRAPHS, DAVENPORTID)
        ou_hits_all_projections = ou_hitting_ids[['BPID','IDFANGRAPHS', 'DAVENPORTID', 'Player', 'Hits_OU']]

        ou_hits_all_projections = ou_hits_all_projections.query('Hits_OU > 0')

        #Join to BP 50th percentile
        #Merge PECOTA percentile datasets to the OU hits dataframe
        ou_hits_all_projections = pd.merge(ou_hits_all_projections, pecota_hitting_50, how = 'left', left_on='BPID', right_on='bpid_PECOTA_50')

        #Bring in only rows that you need for each projection system
        ou_hits_all_projections = ou_hits_all_projections[['BPID','IDFANGRAPHS', 'DAVENPORTID', 'Player', 'Hits_OU','h_PECOTA_50']]

        #Join to BP 99th percentile
        #Merge PECOTA percentile datasets to the OU strikeouts dataframe
        ou_hits_all_projections = pd.merge(ou_hits_all_projections, pecota_hitting_99, how = 'left', left_on='BPID', right_on='bpid_PECOTA_99')

        #Bring in only rows that you need for each projection system
        ou_hits_all_projections = ou_hits_all_projections[['BPID','IDFANGRAPHS', 'DAVENPORTID', 'Player', 'Hits_OU','h_PECOTA_50', 'h_PECOTA_99']]

        #Merge ZIPs dataset to the OU hits dataframe
        ou_hits_all_projections = pd.merge(ou_hits_all_projections, zips_hitting, how = 'left', left_on='IDFANGRAPHS', right_on='PlayerId_ZIPs')

        #Bring in only rows that you need for each projection system
        ou_hits_all_projections = ou_hits_all_projections[['BPID','IDFANGRAPHS', 'DAVENPORTID', 'Player', 'Hits_OU', 'h_PECOTA_50','h_PECOTA_99', 'H_ZIPs']]

        #Merge Steamer dataset to the OU hits dataframe
        ou_hits_all_projections = pd.merge(ou_hits_all_projections, steamer_hitting, how = 'left', left_on='IDFANGRAPHS', right_on='PlayerId_steamer')

        #Bring in only rows that you need for each projection system
        ou_hits_all_projections = ou_hits_all_projections[['BPID','IDFANGRAPHS', 'DAVENPORTID', 'Player', 'Hits_OU', 'h_PECOTA_50','h_PECOTA_99', 'H_ZIPs', 'H_steamer']]

        #Merge the_bat dataset to the OU hits dataframe
        ou_hits_all_projections = pd.merge(ou_hits_all_projections, the_bat_hitting, how = 'left', left_on='IDFANGRAPHS', right_on='PlayerId_the_bat')

        #Bring in only rows that you need for each projection system
        ou_hits_all_projections = ou_hits_all_projections[['BPID','IDFANGRAPHS', 'DAVENPORTID', 'Player', 'Hits_OU', 'h_PECOTA_50','h_PECOTA_99', 'H_ZIPs', 'H_steamer', 'H_the_bat']]

        #Merge the_bat dataset to the OU hits dataframe
        ou_hits_all_projections = pd.merge(ou_hits_all_projections, davenport_hitting, how = 'left', left_on='DAVENPORTID', right_on='HOWEID_davenport')

        #Bring in only rows that you need for each projection system
        ou_hits_all_projections = ou_hits_all_projections[['BPID','IDFANGRAPHS', 'DAVENPORTID', 'Player', 'Hits_OU', 'h_PECOTA_50','h_PECOTA_99', 'H_ZIPs', 'H_steamer', 'H_the_bat', 'H_davenport']]

        #Calculates the average for all projection systems in the relevant O/U field; try and eventually find a way to get what you dividing by an automatic calc
        ou_hits_all_projections['hit_avg'] = np.where((ou_hits_all_projections.Hits_OU > 0),
                                             ((ou_hits_all_projections.h_PECOTA_50+
                                               ou_hits_all_projections.h_PECOTA_99+
                                               ou_hits_all_projections.H_ZIPs
                                              +ou_hits_all_projections.H_steamer + ou_hits_all_projections.H_the_bat +
                                               ou_hits_all_projections.H_davenport))/
                                          ((len(ou_hits_all_projections.columns)-5)), 0)
        
        #Adds in field taking the % difference between the average and the O/U
        ou_hits_all_projections['hit_avg_diff'] = (ou_hits_all_projections['Hits_OU']-ou_hits_all_projections['hit_avg'])/(ou_hits_all_projections['Hits_OU'])
        
        return ou_hits_all_projections.sort_values('hit_avg_diff', ascending=False).reset_index(drop=True)
    
python pandas function join
1个回答
0
投票

鉴于任何来回都会受到限制......我会做出一些假设。

不要将数据帧 (df) 视为有条件的,而是从 ETL 解决方案如何工作的角度来看待。 提取数据、您的初始 df,并努力创建原始 df 的转换以满足条件。 根据条件的满足情况,您可以访问适当的转换。 总之,消除用户输入的影响,以便基于对原始 df 进行适当的转换来驱动条件以满足结果。

© www.soinside.com 2019 - 2024. All rights reserved.