我开始使用Python进行数据分析项目,我有一个员工流失的数据集,所以我有一个带有两个值0和1的分类列名称(Attrition),数据集的其余部分包括int,对象数据类型。我的问题是,当我尝试使用 (.groupby) 函数按此分类列对数据集进行分组时,我无法编写 (.mean() ) 函数 正如您在捕获中看到的那样
# import libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
# read the dataset
df= pd.read_csv('C:/Users/LENOVO/Desktop/internship/HR-Employee.csv')
df.head()
#EDA Exploration Data Analysis
df.shape # 1470 Raws with 35 columns (factors)
nullValues = df.isnull().sum().sum()#EDA : is to identify the pattterns through different data visualization
nullValues #No null values in this dataset
duplicatedValues= df.duplicated().sum()
duplicatedValues# No duplcated values in this dataset
df = df.replace(to_replace = ['Yes','No'],value = ['1','0'])
df = df.replace(to_replace = ['Travel_Rarely',
'Travel_Frequently','Non-Travel'],value = ['2','1','0'])
df = df.replace(to_replace = ['Married','Single','Divorced'],value = ['2','1','0'])
df = df.replace(to_replace = ['Male','Female'],value = ['1','0'])
#---
df = df.replace(to_replace = ['Human Resources','Research & Development','Sales'],value = ['0','1','2'])
df = df.replace(to_replace = ['Human Resources','Life Sciences','Marketing','Medical','Technical Degree','Other'],value = ['0','1','2','3','4','5'])
df = df.replace(to_replace = ['Healthcare Representative','Human Resources','Laboratory Technician','Manager','Manufacturing Director','Research Director','Research Scientist','Sales Executive','Sales Representative'],value = [0,1,2,3,4,5,6,7,8])
# drop unnecessery columns
DF = df.drop(['EmployeeCount','Over18','StandardHours'])
# Let's see the information of our updated dataset DF
DF.info()
''' This dataset had 1470 samples and 32 attributes,
(24 integer + 8 objects ) No variables have non null/
missing values'''
DF.describe()
left= DF.groupby('Attrition')
left.mean()
您收到的错误 “+ 不支持的操作数类型:int 和 str” 足以理解该问题。您要应用
mean()
的字段包含 int 和 string 类型的数据。
在应用
mean()
之前,尝试将列中的数据设为相同类型。
有很多可能的方法来检查列中的不同数据类型,例如,您可以使用以下方法检查列中的不同数据类型:
df.<column_name>.apply(type).value_counts()
为了演示这一点,我从 Kaggle 获取了 Titanic 数据集,该数据集如下所示:
最后,我在
“性别”列上应用
groupby()
,然后将 mean()
用作:
瞧……它成功了。