如何在同一行名称中逐列插入空行的值,然后将插值数据复制到原始DataFrame?

问题描述 投票:1回答:1

我有一个电子表格,提供了2019年世界幸福报告的统计数据,后来将用于可视化和线性回归问题(这是一个小组项目,我的部分是清理数据,以便尽可能少的空值) 。

我只对2010年以及之后的年份感兴趣。某些国家的数据在特定年份完全缺失(例如,埃塞俄比亚缺少2010年和2011年)。我想通过插值来预测那些国家(生命阶梯和人均GDP)的缺失参数。

该文件可在此处找到:https://s3.amazonaws.com/happiness-report/2019/Chapter2OnlineData.xls

到目前为止,我所做的是为每个国家/地区创建一个新的DataFrame并尝试为该国家/地区进行插值。 (代码如下。)请注意,dropdata是我通过删除可用信息太少的国家创建的DataFrame,例如阿曼。

另外,我在原始电子表格中手动插入了国家和年份(例如,埃塞俄比亚,2011年)和空白数据值的行。

但插值根本不起作用。我一直看到NaN值,并且在打印DataFrame时,我插入的新行根本没有显示。

以下是示例输出。

Country name  Year  Life Ladder  Log GDP per capita  Social support  \
     Ethiopia  2012     4.561169            7.115237        0.658794   
     Ethiopia  2013     4.444827            7.189737        0.602482   
     Ethiopia  2014     4.506647            7.261595        0.640452   
     Ethiopia  2015     4.573155            7.335052        0.625597   
     Ethiopia  2016     4.297849            7.382929        0.718719   
     Ethiopia  2017     4.180315            7.455834        0.733540   
     Ethiopia  2018     4.379262            7.524517        0.740155   

     Healthy life expectancy at birth  Freedom to make life choices  \
                         55.200001                      0.776308   
                         55.799999                      0.706796   
                         56.400002                      0.693559   
                         57.000000                      0.802643   
                         57.500000                      0.744308   
                         58.000000                      0.717101   
                         58.500000                      0.740343   

     Generosity  Perceptions of corruption  
   -0.036612                        NaN  
   -0.000997                   0.750478  
    0.086612                   0.701800  
    0.118702                   0.567027  
    0.045363                   0.702881  
    0.007519                   0.756899  
    0.043274                   0.799466  

我使用的代码。

country_list = dropdata['Country name']
for country in country_list:
    countryDF = dropdata.loc[dropdata['Country name'] == country, :] #Creates a dataFrame for each country.
    countryDF2 = countryDF.iloc[0:20, 0:9]  #We are interested only in the first 9 rows.
    countryDF2.interpolate(method ='values', axis = 0, limit_direction ='both', limit = 3)

尽管已经在两个方向上进行了插值,但仍然存在NaN值。更重要的是,我必须将每个国家/地区的DataFrame中的插值复制回所有行的原始DataFrame(将被视为dropdata)。我从哪里开始?

python pandas dataframe interpolation
1个回答
1
投票

使用GroupBy.apply的自定义函数仅按位置过滤值,但首先使用DataFrame.reindex添加MultiIndex.from_product缺少的行:

df = pd.read_excel('Chapter2OnlineData.xls')

mux = pd.MultiIndex.from_product([df['Country name'].unique(), 
                                  np.arange(df['Year'].min(), df['Year'].max() + 1)],
                                  names=['Country name','Year'])
df = df.set_index(['Country name','Year']).reindex(mux).reset_index()

print (df[df['Country name'] == 'Algeria'].iloc[0:20, 0:9])
  Country name  Year  Life Ladder  Log GDP per capita  Social support  \
28      Algeria  2005          NaN                 NaN             NaN   
29      Algeria  2006          NaN                 NaN             NaN   
30      Algeria  2007          NaN                 NaN             NaN   
31      Algeria  2008          NaN                 NaN             NaN   
32      Algeria  2009          NaN                 NaN             NaN   
33      Algeria  2010     5.463567            9.462701             NaN   
34      Algeria  2011     5.317194            9.471962        0.810234   
35      Algeria  2012     5.604596            9.485086        0.839397   
36      Algeria  2013          NaN                 NaN             NaN   
37      Algeria  2014     6.354898            9.509210        0.818189   
38      Algeria  2015          NaN                 NaN             NaN   
39      Algeria  2016     5.340854            9.541166        0.748588   
40      Algeria  2017     5.248912            9.540639        0.806754   
41      Algeria  2018     5.043086            9.557952        0.798651   

    Healthy life expectancy at birth  Freedom to make life choices  \
28                               NaN                           NaN   
29                               NaN                           NaN   
30                               NaN                           NaN   
31                               NaN                           NaN   
32                               NaN                           NaN   
33                         64.500000                      0.592696   
34                         64.660004                      0.529561   
35                         64.820000                      0.586663   
36                               NaN                           NaN   
37                         65.139999                           NaN   
38                               NaN                           NaN   
39                         65.500000                           NaN   
40                         65.699997                      0.436670   
41                         65.900002                      0.583381   

    Generosity  Perceptions of corruption  
28         NaN                        NaN  
29         NaN                        NaN  
30         NaN                        NaN  
31         NaN                        NaN  
32         NaN                        NaN  
33   -0.229078                   0.618038  
34   -0.204406                   0.637982  
35   -0.195859                   0.690116  
36         NaN                        NaN  
37         NaN                        NaN  
38         NaN                        NaN  
39         NaN                        NaN  
40   -0.191522                   0.699774  
41   -0.172413                   0.758704  

def f(x):
    x.iloc[0:20, 0:9] = x.iloc[0:20, 0:9].interpolate(method ='values',
                                                      axis = 0, 
                                                      limit_direction ='both', 
                                                      limit = 3)
    return x

df = df.groupby('Country name').apply(f)
print (df[df['Country name'] == 'Algeria'].iloc[0:20, 0:9])

   Country name  Year  Life Ladder  Log GDP per capita  Social support  \
28      Algeria  2005          NaN                 NaN             NaN   
29      Algeria  2006          NaN                 NaN             NaN   
30      Algeria  2007     5.463567            9.462701             NaN   
31      Algeria  2008     5.463567            9.462701        0.810234   
32      Algeria  2009     5.463567            9.462701        0.810234   
33      Algeria  2010     5.463567            9.462701        0.810234   
34      Algeria  2011     5.317194            9.471962        0.810234   
35      Algeria  2012     5.604596            9.485086        0.839397   
36      Algeria  2013     5.979747            9.497148        0.828793   
37      Algeria  2014     6.354898            9.509210        0.818189   
38      Algeria  2015     5.847876            9.525188        0.783389   
39      Algeria  2016     5.340854            9.541166        0.748588   
40      Algeria  2017     5.248912            9.540639        0.806754   
41      Algeria  2018     5.043086            9.557952        0.798651   

    Healthy life expectancy at birth  Freedom to make life choices  \
28                               NaN                           NaN   
29                               NaN                           NaN   
30                         64.500000                      0.592696   
31                         64.500000                      0.592696   
32                         64.500000                      0.592696   
33                         64.500000                      0.592696   
34                         64.660004                      0.529561   
35                         64.820000                      0.586663   
36                         64.980000                      0.556665   
37                         65.139999                      0.526666   
38                         65.320000                      0.496668   
39                         65.500000                      0.466669   
40                         65.699997                      0.436670   
41                         65.900002                      0.583381   

    Generosity  Perceptions of corruption  
28         NaN                        NaN  
29         NaN                        NaN  
30   -0.229078                   0.618038  
31   -0.229078                   0.618038  
32   -0.229078                   0.618038  
33   -0.229078                   0.618038  
34   -0.204406                   0.637982  
35   -0.195859                   0.690116  
36   -0.194991                   0.692048  
37   -0.194124                   0.693979  
38   -0.193257                   0.695911  
39   -0.192389                   0.697843  
40   -0.191522                   0.699774  
41   -0.172413                   0.758704  
© www.soinside.com 2019 - 2024. All rights reserved.