如果子数据帧包含超过40%的NaN熊猫,如何删除子数据帧

问题描述 投票:0回答:3

大家好,我有这样的问题:

我有400.000个对象的面板数据,并且如果它包含超过40%的NaN,我想删除对象

例如:

               inn  time_reg   revenue1    balans1    equity1  opprofit1  \
0        0101000021      2006   457000.0   115000.0    28000.0    29000.0   
1        0101000021      2007  1943000.0   186000.0   104000.0    99000.0   
2        0101000021      2008  2812000.0   318000.0   223000.0   127000.0   
3        0101000021      2009  2673000.0   370000.0   242000.0    39000.0   
4        0101000021      2010  3240000.0   435000.0    45000.0        NaN   
...             ...       ...        ...        ...        ...        ...   
4081810  9909403758      2003  6943000.0  2185000.0  2136000.0   -97000.0   
4081811  9909403758      2004  6504000.0  2245000.0  2196000.0   -34000.0   
4081812  9909403758      2005        NaN        NaN        NaN        NaN   
4081813  9909403758      2006        NaN        NaN        NaN        NaN   
4081814  9909403758      2007        NaN        NaN        NaN        NaN   

        grossprofit1  netprofit1  currentassets1  stliabilities1  
0             92000.0     18000.0        105000.0         87000.0  
1            189000.0     76000.0        176000.0         82000.0  
2            472000.0    119000.0        308000.0         95000.0  
3            483000.0     29000.0        360000.0        128000.0  
4                 NaN     35000.0             NaN             NaN  
...               ...         ...             ...             ...  
4081810     2365000.0    -59000.0        253000.0         49000.0  
4081811     2278000.0     60000.0        425000.0         49000.0  
4081812           NaN         NaN             NaN             NaN  
4081813           NaN         NaN             NaN             NaN  
4081814           NaN         NaN             NaN             NaN  

我有这样的数据框,并且对于按(inn,time_reg)分组的每个子数据框,如果列中的总nans(revenue1 balans1 Equity1 opprofit1 grossprofit1 netprofit1 currentassets1 stlibility1)大于40%,我需要将其删除。

我有一个循环执行此操作的想法,但这需要很多时间例如:

                inn  time_reg   revenue1    balans1    equity1  opprofit1  \
4081809  9909403758      2002  6078000.0  2270000.0  2195000.0   -32000.0   
4081810  9909403758      2003  6943000.0  2185000.0  2136000.0   -97000.0   
4081811  9909403758      2004  6504000.0  2245000.0  2196000.0   -34000.0   
4081812  9909403758      2005        NaN        NaN        NaN        NaN   
4081813  9909403758      2006        NaN        NaN        NaN        NaN   
4081814  9909403758      2007        NaN        NaN        NaN        NaN   

         grossprofit1  netprofit1  currentassets1  stliabilities1  
4081809     1324000.0         NaN        234000.0         75000.0  
4081810     2365000.0    -59000.0        253000.0         49000.0  
4081811     2278000.0     60000.0        425000.0         49000.0  
4081812           NaN         NaN             NaN             NaN  
4081813           NaN         NaN             NaN             NaN  
4081814           NaN         NaN             NaN             NaN  

应该删除此子数据框,因为它包含超过40%的nans

      inn  time_reg    revenue1    balans1   equity1  opprofit1  \
0   0101000021      2006    457000.0   115000.0   28000.0    29000.0   
1   0101000021      2007   1943000.0   186000.0  104000.0    99000.0   
2   0101000021      2008   2812000.0   318000.0  223000.0   127000.0   
3   0101000021      2009   2673000.0   370000.0  242000.0    39000.0   
4   0101000021      2010   3240000.0   435000.0   45000.0        NaN   
5   0101000021      2011   3480000.0   610000.0   71000.0        NaN   
6   0101000021      2012   4820000.0   710000.0  139000.0   149000.0   
7   0101000021      2013   5200000.0   790000.0  148000.0   170000.0   
8   0101000021      2014   5450000.0   830000.0  155000.0   180000.0   
9   0101000021      2015   5620000.0   860000.0  164000.0   189000.0   
10  0101000021      2016   5860000.0   885000.0  175000.0   200000.0   
11  0101000021      2017  15112000.0  1275000.0  298000.0   323000.0   

    grossprofit1  netprofit1  currentassets1  stliabilities1  
0        92000.0     18000.0        105000.0         87000.0  
1       189000.0     76000.0        176000.0         82000.0  
2       472000.0    119000.0        308000.0         95000.0  
3       483000.0     29000.0        360000.0        128000.0  
4            NaN     35000.0             NaN             NaN  
5            NaN     61000.0             NaN             NaN  
6       869000.0    129000.0        700000.0        571000.0  
7      1040000.0    138000.0        780000.0        642000.0  
8      1090000.0    145000.0        820000.0        675000.0  
9      1124000.0    154000.0        850000.0        696000.0  
10     1172000.0    165000.0        875000.0        710000.0  
11     3023000.0    288000.0       1265000.0        977000.0  

此子数据框包含的nans少于40%,并且必须位于最终数据框中

python pandas dataframe nan
3个回答
0
投票

如果您使用numpy / pandas函数进行计数,循环会不会太慢?您可以使用someDataFrame.isnull().sum().sum()

可能比编写您自己的循环遍历数据帧中的所有值快很多,因为那些库往往具有此类功能的非常有效的实现。


0
投票

您可以使用filterpd.DataFrame.groupby方法。这使您可以传递一个函数,该函数指示是否应过滤子帧(在这种情况下,如果相关列中的NaN含量超过40%)。要获取该信息,您可以像在numpy中那样使用getNanFraction来计算nans:

def getNanFraction(df):
    nanCount = np.sum(np.isnan(df.drop("inn", axis=1).values))
    return nanCount/len(df)


df.groupby("inn").filter(lambda x: getNanFraction(x) < 0.4 )


0
投票

您可以设置阈值并基于df.count(axis=1)应用过滤器。我假设您的数据框名为df

thres=(len(df.columns)-2)*0.4
df=df[df.count(axis=1)-2>thres]

您需要-2,因为您不想计算索引列(在您的情况下为'inn'和'time_reg'。

如果将'inn'和'time_reg'设置为索引,则代码看起来会更优雅一些,因为您无需减去1:

df.set_index(['inn','time_reg'],inplace=True)
thres=(len(df.columns))*0.4
df=df[df.count(axis=1)>thres]
© www.soinside.com 2019 - 2024. All rights reserved.