如何删除使用pd.get_dummies后使用方差作为截止值生成的重复列

Question

我有一个使用pd.get_dummies生成的数据框，如下所示：

df_target = pd.get_dummies(df_column[column], dummy_na=True,prefix=column)

其中column是列名，df_column是从中拉出每列以执行某些操作的数据帧。

rev_grp_m2_> 225    rev_grp_m2_nan  rev_grp_m2_nan
0                       0                   0
0                       0                   0
0                       0                   0
0                       0                   0
0                       0                   0
0                       0                   0
0                       0                   0
1                       0                   0
0                       0                   0
0                       0                   0
0                       0                   0
0                       0                   0

现在，我检查生成的每个列的方差，并跳过零方差的方差。

for target_column in list(df_target.columns):
    # If variance of the dummy created is zero : append it to a list and print to log file.
    if ((np.var(df_target_attribute[[target_column]])[0] != 0)==True):
        df_final[target_column] = df_target[target_column]

这里由于两列相同，我得到了np.var行的Key Error。纳米柱有两个方差值：

erev_grp_m2_nan    0.000819
rev_grp_m2_nan    0.000000

理想情况下，我想采用非零方差的一个，并删除/跳过0 var的一个。

有人可以帮我这样做吗？

Answer 1

对于DataFrame.var使用：

print (df.var())
rev_grp_m2_> 225    0.083333
rev_grp_m2_nan      0.000000
rev_grp_m2_nan      0.000000

最后用于过滤使用boolean indexing：

out = df.loc[:, df.var()!= 0]
print (out)
    rev_grp_m2_> 225
0                  0
1                  0
2                  0
3                  0
4                  0
5                  0
6                  0
7                  1
8                  0
9                  0
10                 0
11                 0

编辑：您可以获得非0值的索引，然后通过qazxsw poi选择：

iloc

如果所有值都是cols = [i for i in np.arange(len(df.columns)) if np.var(df.iloc[:, i]) != 0] print (cols) [0] df = df.iloc[:, cols] print (df) rev_grp_m2_> 225 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 1 8 0 9 0 10 0 11 0，则另一个想法是过滤掉：

要么：

cols = [i for i in np.arange(len(df.columns)) if (df.iloc[:, i] != 0).any()]
out = df.iloc[:, cols]

如何删除使用pd.get_dummies后使用方差作为截止值生成的重复列

问题描述投票：1回答：1

1个回答

最新问题

如何删除使用pd.get_dummies后使用方差作为截止值生成的重复列

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1