我要自动标记“票价”量化范围,如下所示。
我的数据如下:
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
我做了:
df['FareBin'] = pd.qcut(df['Fare'], 4)
df[['FareBin', 'Survived']].groupby(['FareBin'], as_index=False).mean().sort_values(by='FareBin', ascending=True)
FareBin Survived
0 (-0.001, 7.896] 0.197309
1 (7.896, 14.454] 0.303571
2 (14.454, 31.275] 0.441048
3 (31.275, 512.329] 0.600000
现在,我想以某种智能的方式用字符串标签替换(-0.001,7.896]之类的带。
我尝试过:
df.loc[ df['Fare'] <= 7.91, 'Fare'] = 'Low'
df.loc[(df['Fare'] > 7.91) & (df['Fare'] <= 14.454), 'Fare'] = 'Mid low'
...
有没有一种方法可以做到,所以我不需要列出所有这样的条件?谢谢。
您可以在labels
功能中使用参数qcut()
:
pd.qcut(range(5), 3, labels=["good", "medium", "bad"])
输出:
[good, good, medium, bad, bad]