将df.where应用于熊猫中的选择性列以删除混合数据类型数据集中的异常值

Question

Python和pandas新手设置了数据清理管道以准备df以进行机器学习。我想识别并删除异常值，并在适当位置替换（例如）算术平均值。

df已清除，因此将字符串（'Identifiers'）的列＃1设置为索引（type = object），其余的列均为纯数字，并设置为float。玩具输入df的不标识版本：

Identifiers        foo  categorical   bar  score1  score2  score3
0         bob  10.723134          0.0   1.0    40.0     3.0    48.0
1       carol  14.115446          1.0   0.0    34.0     2.0    43.0
2       alice  12.573351          0.0  26.0    69.0    21.0    70.0
3         jim  10.793862          0.0  17.0    65.0     3.0    48.0
4      nathan   9.633013          0.0   2.0    44.0     9.0    60.0

以下代码成功运行：

for col in df_pheno:
s = df_pheno.mean(axis = 0)
q = df_pheno.std (axis = 0)
r = s + (3 * q)
if col == 'Identifiers':
    continue
elif col != 'Identifiers':
    for i, row_value in df_pheno[col].iteritems():
        if row_value > r.loc[col]:
            row_value = df_pheno.replace(row_value,s.loc[col],inplace = True)
        elif row_value <= r.loc[col]:
            continue

输出（注意：在玩具示例中，条件从r更改为s，否则没有什么不同）：

df_pheno.head()
Out[209]: 
  Identifiers        foo  categorical   bar  score1  score2  score3
0         bob  10.723134          0.0  0.20    40.0     3.0    48.0
1       carol  11.567761          0.2  0.00    34.0     2.0    43.0
2       alice  11.567761          0.0  9.04    50.4     7.6    53.8
3         jim  10.793862          0.0  9.04    50.4     3.0    48.0
4      nathan   9.633013          0.0  2.00    44.0     7.6    53.8

我想看看df.where是否加快了操作速度，但是在各种排列下，我要么a）无法使它忽略“标识符”列，要么b）输入了非NaN值。出于处理下一步的原因，我宁愿不要插入NaN，然后输入非NaN值-如果可能的话。努力/问题示例：

for col in df_pheno:
s = df_pheno.mean(axis = 0)
q = df_pheno.std (axis = 0)
r = s + (3 * q)
if col == 'Identifiers':
    continue
elif col != 'Identifiers':
    df_pheno.where(df_pheno > r, s, inplace=True, axis=1)

TypeError：无法对具有非np.nan值的混合类型进行就地布尔设置

以及：

for col in df_pheno:
s = df_pheno.mean(axis = 0)
q = df_pheno.std (axis = 0)
r = s + (3 * q)
if col == 'Identifiers':
    continue
elif col != 'Identifiers':
    df_pheno[col].where(df_pheno[col] > r, s[col], inplace=True, axis=1)
ValueError：只能比较标记相同的Series对象

非常感谢您的帮助。

Python和pandas新手设置了数据清理管道以准备df以进行机器学习。我想识别并删除异常值，并在适当位置替换（例如）算术平均值。 ...

Answer 1

您的问题是因为std（s），mean和r的序列没有标识符的值，但是DataFrame却有。这就是为什么我要使用set_index('Identifiers')，并且当我完成reset_index()的操作后的原因>

我认为您只需要：

将df.where应用于熊猫中的选择性列以删除混合数据类型数据集中的异常值

问题描述投票：0回答：1

1个回答

最新问题

将df.where应用于熊猫中的选择性列以删除混合数据类型数据集中的异常值

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1