删除R中的常量列

Question

收到此错误时我正在使用 prcomp 函数

Error in prcomp.default(x, ...) : 
cannot rescale a constant/zero column to unit variance

我知道我可以手动扫描数据，但是 R 中是否有任何函数或命令可以帮助我删除这些常量变量？我知道这是一个非常简单的任务，但我从未遇到过任何执行此操作的函数。

谢谢，

Answer 1

这里的问题是你的列方差等于零。您可以通过这种方式检查数据框的哪一列是恒定的，例如：

df <- data.frame(x=1:5, y=rep(1,5))
df
#   x y
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1

# Supply names of columns that have 0 variance
names(df[, sapply(df, function(v) var(v, na.rm=TRUE)==0)])
# [1] "y"

所以如果你想排除这些列，你可以使用：

df[,sapply(df, function(v) var(v, na.rm=TRUE)!=0)]

编辑： 事实上，使用

apply

更简单。像这样的东西：

df[,apply(df, 2, var, na.rm=TRUE) != 0]

Answer 2

我猜这个问答是一个流行的谷歌搜索结果，但对于一个大矩阵来说，答案有点慢，而且我没有足够的声誉来评论第一个答案。因此我发布了这个问题的新答案。

对于大矩阵的每一列，检查最大值是否等于最小值就足够了。

df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]

这就是测试。比第一个答案减少了90%以上的时间。它也比问题的第二条评论的答案更快。

ncol = 1000000
nrow = 10
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix

time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]) # my method
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 })]) # Keith's method

time1
#   user  system elapsed 
# 22.267   0.194  22.626 
time2
#   user  system elapsed 
#  2.073   0.077   2.155 
time3
#   user  system elapsed 
#  6.702   0.060   6.790
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE

Answer 3

由于此问答是流行的 Google 搜索结果，但对于大型矩阵来说答案有点慢，并且 @raymkchow 版本对于 NA 来说很慢，我建议使用指数搜索和

data.table

功率的新版本。

这是我在 dataPreparation 包中实现的功能。

首先构建一个示例 data.table，行数多于列数（通常是这种情况）和 10% 的 NA

ncol = 1000
nrow = 100000
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df <- apply (df, 2, function(x) {x[sample( c(1:nrow), floor(nrow/10))] <- NA; x} ) # Add 10% of NAs
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
df <- as.data.table(df)

然后对所有方法进行基准测试：

time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0, with = F]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE)), with = F]) # raymkchow
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 }), with = F]) # Keith's method
time4 <- system.time(df4 <- df[,-which_are_constant(df, verbose=FALSE)]) # My method

结果如下：

time1 # Variance approch
#   user  system elapsed 
#   2.55    1.45    4.07
time2 # Min = max approach
#   user  system elapsed 
#  2.72      1.5    4.22
time3 # length(unique()) approach
#   user  system elapsed 
#    6.7    2.75    9.53
time4 # Exponential search approach
#   user  system elapsed 
#   0.39    0.07    0.45
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
all.equal(df4, df2)
# [1] TRUE

dataPreparation:which_are_constant

比其他方法快 10 倍。

加上行数越多，使用起来就越有趣。

Answer 4

janitor

库有注释

remove_constant

可以帮助删除常量列。让我们创建一个综合数据来进行说明：

library(janitor)
test_dat <- data.frame(A=1, B=1:10, C= LETTERS[1:10])
test_dat

这是测试数据

那么注释remove_constant可以帮助删除常量列

remove_constant(test_dat)
remove_constant(test_dat, na.rm= TRUE)

使用上面两条注释，我们将得到：

注意：使用参数

na.rm = TRUE

确保任何具有一个值和 NA 的列也将被删除。例如，

test_dat_with_NA <- data.frame(A=c(1, NA), B=1:10, C= LETTERS[1:10])
test_dat_with_NA

我们得到的test_dat_with_NA：

然后评论

remove_constant(test_dat_with_NA)

无法删除A列

评论的同时

remove_constant(test_dat_with_NA, na.rm= TRUE)

可以删除只有值 1 和 NA 的 A 列：

Answer 5

如果您正在寻找返回 df 中非常量变量的 dplyr 解决方案，我建议您使用以下方法。或者，如果需要列名称，您可以添加

%>% colnames()

：

library(dplyr)
df <- data.frame(x = 1:5, y = rep(1,5))
# returns dataframe
var_df <- df %>%
  select_if(function(v) var(v, na.rm=TRUE) != 0)  
var_df %>% colnames() # returns column names

Answer 6

Keith 评论的 tidyverse 版本：

df %>% purrr::keep(~length(unique(.x)) != 1)

Answer 7

这是我用于 data.frames 的函数

box::use(
  purrr[map_if],
)
#' Delete invariant columns
#'
#' @param df        data.frame  
#' @param exclude   string      do not remove these columns
#' 
#' @return
#' @export
#'
#' @examples
remove_invariant_columns <- function(
  df,
  exclude = NA
){
    
  cols <- map_if(
    .x = df,
    .p = is.numeric,
    .f = function(col) { 
      var(col, na.rm=TRUE) != 0 
    },      
    .else = function(col) { 
      length(unique(col)) > 1 
    }
  )
  
  cols <- names(cols[unlist(cols)])
  
  if(!is.na(exclude)){
    cols <- unique(c(exclude, cols))
  }
  
  return(df[, cols])

}

删除R中的常量列

问题描述投票：0回答：7

7个回答

最新问题

删除R中的常量列

问题描述 投票：0回答：7

7个回答

最新问题

问题描述投票：0回答：7