排除 R 中方差为零的变量的最快方法

Question

我正在使用一个非常庞大的 .csv 数据集进行评估，但我有这个错误需要解决。

Warning in preProcess.default(data, method = c("center", "scale")) :
  These variables have zero variances: num_outbound_cmds, is_host_login
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

在我的数据集中排除变量的最快方法是什么

whose variance is zero (0)

？

Answer 1

R 包

caret

有一个函数

nearZeroVar

可以很好地识别矩阵或数据框中方差为零或接近零的列。它将索引作为向量返回，您可以使用它来删除这些列。

> df <- data.frame(a=1:5, b=sample(1:5), c=rep(1,5))
> df
  a b c
1 1 4 1
2 2 2 1
3 3 1 1
4 4 5 1
5 5 3 1
> nearZeroVar(df)
[1] 3
> df[,-nearZeroVar(df)]
  a b
1 1 4
2 2 2
3 3 1
4 4 5
5 5 3

Answer 2

使用@Dthal 的示例，

base R

选项将使用

Filter

Filter(var, df)
#  a b
#1 1 4
#2 2 2
#3 3 1
#4 4 5
#5 5 3

上面的工作原理是将 0 的方差转换为 FALSE，将所有其他值转换为 TRUE，

Filter

只有那些返回 TRUE 的列。

Answer 3

如果你的目标是

不考虑任何额外的包裹，并且
你靠
```
tidyverse
```
，和
你想考虑 NA 值，你可以做类似的事情

library(dplyr)
df <- data.frame(
  a = seq(5), 
  b = c(NA, rep(1, 4)), 
  c = c(1, 2, NA, 3, 4),
  d = rep(1, 5)
)
df
#   a  b  c d
# 1 1 NA  1 1
# 2 2  1  2 1
# 3 3  1 NA 1
# 4 4  1  3 1
# 5 5  1  4 1

Filter(
  function(x) case_when(
    all(is.na(x)) ~ FALSE,
    !all(is.na(x)) & var(x, na.rm = TRUE) == 0 ~ FALSE,
    TRUE ~ TRUE
  ), 
  df
)
#   a  c
# 1 1  1
# 2 2  2
# 3 3 NA
# 4 4  3
# 5 5  4

这只会在删除

NA

后计算方差。

如果您也想保留

列，即，也将

NA

视为变化的一部分，您可以用

all

切换掉

any

并做

Filter(
    function(x) case_when(
        all(is.na(x)) ~ FALSE,
        !any(is.na(x)) & var(x, na.rm = TRUE) == 0 ~ FALSE,
        TRUE ~ TRUE
    ), 
    df
)
#   a  b  c
# 1 1 NA  1
# 2 2  1  2
# 3 3  1 NA
# 4 4  1  3
# 5 5  1  4

排除 R 中方差为零的变量的最快方法

问题描述投票：0回答：3

3个回答

最新问题

排除 R 中方差为零的变量的最快方法

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3