在尝试操作/过滤由groupby操作创建的数据帧时,使用(> =&<=)时出现问题

问题描述 投票:0回答:1

所以我正在尝试从数据集中删除离群值。这是房地产数据,因此我使用groupby对“区域/区域”进行了分组(在代码上实际显示为“ Zona”),并使用每“区域/区域”的价格计算了IQR,但现在我我试图使用“> =&<=”来过滤掉异常值,但我得到了TypeError。

这是我的代码。

首先,我仅使用“区域”和“价格”创建了一个新的数据框,并使用箱形图检查是否存在异常值。

#Create a new dataframe with only "Precio USD" & "Zona"
gt_venta_precio_zona = gt_venta[['Precio USD','Zona']]
#Group by "Zona"
grp = gt_venta_precio_zona.groupby('Zona')
#Iterate through the groups to get the keys (titles) of each "Zona" and plot the results
for key in grp.groups.keys():
    grp.get_group(key).plot.box(title=key)

[在注意到“ Zone”的异常值之后,我计算了IQR并尝试使用该代码通过“ Zone”过滤出异常值,这是代码。

Q1 = grp['Precio USD'].quantile(0.25)
Q3 = grp['Precio USD'].quantile(0.75)
IQR = Q3 - Q1
#Let's reshape the quartiles to be able to operate them
Q1 = Q1.values.reshape(-1,1)
Q3 = Q3.values.reshape(-1,1)
IQR = IQR.values.reshape(-1,1)
print(Q1.shape, Q3.shape, IQR.shape)

一切运行顺利,直到我尝试使用此代码基于该数据过滤数据框:

#Let's filter the dataset based on the IQR * +- 1.5
filter = (grp['Precio USD'] >= Q1 - 1.5 * IQR) & (grp['Precio USD'] <= Q3 + 1.5 *IQR)
grp.loc[filter]

我得到以下回溯:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-117-09dffe5671dd> in <module>
      9 
     10 #Let's filter the dataset based on the IQR * +- 1.5
---> 11 filter = (grp['Precio USD'] >= Q1 - 1.5 * IQR) & (grp['Precio USD'] <= Q3 + 1.5 *IQR)
     12 grp.loc[filter]

TypeError: '<=' not supported between instances of 'float' and 'str'

我查看了每个“区域”下所有“价格”值的dtype,它们都是浮点数,分位数和IQR也是一样。

我尝试将所有内容都转换为Int,但由于使用了groupby,我也无法做到这一点。所以我有点卡在这里。

任何帮助将不胜感激!

PS。这是我的完整代码(到目前为止):

# Let's start by loading the dataset

# In[1]:


#Import the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
get_ipython().run_line_magic('matplotlib', 'inline')

# Read the CSV file into a DataFrame: df
gt_df = pd.read_csv('RE_Data_GT.csv')
gt_df.tail()


# Let's do some simple statistical analysis to understand the variables we have and their behaviour a little better.

# In[2]:


#Fill in NaN's with the man of the column on the "Banos" column
gt_df['Banos'] = gt_df['Banos'].fillna(gt_df['Banos'].mean())
gt_df.info()


# In[3]:


gt_df.describe()


# From the table above we can see that a few of the columns have data very spread out (high standard deviation), this is not necessarily bad, because we know the dataset we understand that this could be caused by the two types of listings ('Venta' y 'Alquiler'), it makes sense to have variance if we look at prices by rent and sales at the same time. 
# 
# Now let's move to one of the most exciting parts, which is some exploratory data analysis (EDA). But before we do that, I think that with the information above it would make sense to have two different dataframes one for rentals and other for home sales. 

# In[4]:


gt_alquiler = gt_df[gt_df['Tipo de listing'] == 'Alquiler']
gt_venta = gt_df[gt_df['Tipo de listing'] == 'Venta']
gt_alquiler.info()
gt_venta.info()


# Excellent, it seems like we have 2128 data points for 'Alquiler'(rental) and 3004 for 'Venta' (sales). Now that we have our 2 dataframes, we can actually start to do some EDA, we'll start by looking at home sales (Tipo de listing =='Venta').

# In[5]:


_ = gt_venta['Precio USD'].plot.hist(title = 'Distribucion de Precios de Venta', colormap='Pastel2')
_ = plt.xlabel('Price in USD')


# In[6]:


#Declare a function to compute the ECDF of an array
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.sort(data)

    # y-data for the ECDF: y
    y = np.arange(1, len(data)+1) / n

    return x, y


# In[7]:


#Create Variable to pass to the ECDF function
gt_venta_precio = gt_venta['Precio USD']

#Compute ECDF for
x, y = ecdf(gt_venta_precio)

# Generate plot
_ = plt.plot(x, y, marker='.', linestyle='none')

# Add title and label the axes
_ = plt.title('ECDF de Precio en USD')
_ = plt.xlabel('Precio en USD')
_ = plt.ylabel('ECDF')

# Display the plot
plt.show()


# Apparently there are a few outliers that require our attention. To better understand these points, it's better if we group them by a 'zona' (zona/area) to see which listing has such a high price. 
# 
# Let's start to understand the specific outliers by grouping the listings by "Zona" and then using a box plot for each to review each in more detail.

# In[8]:


#Create a new dataframe with only "Precio USD" & "Zona"
gt_venta_precio_zona = gt_venta[['Precio USD','Zona']]
#Group by "Zona"
grp = gt_venta_precio_zona.groupby('Zona')
#Iterate through the groups to get the keys (titles) of each "Zona" and plot the results
for key in grp.groups.keys():
    grp.get_group(key).plot.box(title=key)


# In[14]:


Q1 = grp['Precio USD'].quantile(0.25)
Q3 = grp['Precio USD'].quantile(0.75)
IQR = Q3 - Q1
#Let's reshape the quartiles to be able to operate them
Q1 = Q1.values.reshape(-1,1)
Q3 = Q3.values.reshape(-1,1)
IQR = IQR.values.reshape(-1,1)
print(Q1.shape, Q3.shape, IQR.shape)

#Let's filter the dataset based on the IQR * +- 1.5
filter = (grp['Precio USD'] >= (Q1 - 1.5 * IQR)) & (grp['Precio USD'] <= (Q3 + 1.5 *IQR))
grp.loc[filter]

可在此处下载数据集:https://drive.google.com/file/d/1JXDm9iYem4DlMoIjx4f7yWBuwjaLRThe/view?usp=sharing

python pandas pandas-groupby
1个回答
0
投票
如果这很有用,我可以通过以下功能实现消除异常值的目标:

def remove_outliers(df, column): "This function takes in a dataframe and removes the outliers using the values of the specified column" #Use the describe() method to identify the statistics of interest describe = df[column].describe() #Create a dictionary for each of the values from the column of interest describe_dict = {"count":0,"mean":1,"std":2,"min":3,"25":4,"50":5,"75":6,"max":7} #Extract quartiles (Q1, Q3) Q1 = describe[describe_dict['25']] Q3 = describe[describe_dict['75']] #Caculate IQR IQR = Q3-Q1 #Define bounds lb = Q1-1.5*IQR ub = Q3+1.5*IQR print("(IQR = {})A point outside of the following range can be considered an outlier: ({},{})".format(IQR,lb,ub)) calc_df = df[(df[column] < lb) | (df[column] > ub)] print("The number of outliers that will be removed out of {} observations are {}.".format(df[column].size,len(calc_df[column]))) #remove the outliers from the dataframe no_outliers = df[~df[column].isin(calc_df[column])] return no_outliers

您只需要向其传递一个数据框,然后指定要用作基础的列即可识别和删除异常值。在我的github上,您可以找到带有快速教程的笔记本:

https://github.com/omartinez182/web-apps/blob/master/Remove%20Outliers.ipynb

© www.soinside.com 2019 - 2024. All rights reserved.