当且仅当所有值均为 NA 时删除命名列

Question

我正在开发一个包，当我想要删除一个命名的全 NA 列而不删除其他也是全 NA 的列时，我遇到了麻烦。

这是一个数据框的示例。在此示例中，我们有两个全 NA 列，这是预期的且正确的。

library(tidyverse)

df <- tribble(
  ~a,       ~b,    ~c,         ~d, ~AR, ~BR,
  1L, "animal", "dog",         NA,  NA,  NA,
  2L, "animal", "cat",         NA,  NA,  NA,
  3L, "animal", "rat",         NA,  NA,  NA,
  4L,  "plant", "oak", "carvalho",  NA,  NA
) %>% 
  mutate_if(is.logical, as.character)

df
#> # A tibble: 4 x 6
#>       a b      c     d        AR    BR   
#>   <int> <chr>  <chr> <chr>    <chr> <chr>
#> 1     1 animal dog   <NA>     <NA>  <NA> 
#> 2     2 animal cat   <NA>     <NA>  <NA> 
#> 3     3 animal rat   <NA>     <NA>  <NA> 
#> 4     4 plant  oak   carvalho <NA>  <NA>

^{由 reprex 包于 2020-02-06 创建（v0.3.0）}

但是，假设我过滤列

以仅显示动物。在这种情况下，我们将拥有三个全 NA 列：

、

AR

和

BR

。

df %>% 
  filter(b == "animal")

df
#> # A tibble: 3 x 6
#>       a b      c     d     AR    BR   
#>   <int> <chr>  <chr> <chr> <chr> <chr>
#> 1     1 animal dog   <NA>  <NA>  <NA> 
#> 2     2 animal cat   <NA>  <NA>  <NA> 
#> 3     3 animal rat   <NA>  <NA>  <NA>

^{由 reprex 包于 2020-02-06 创建（v0.3.0）}

在我正在开发的函数中，我希望在上面的情况下，当

为全 NA 时，它被删除，但任何其他全 NA 列都不会被删除。因此，仅使用

select(-d)

是行不通的，因为它会完全删除

列，即使它有内容。

我已经尝试过

tidyr::drop_na

、

purrr::discard

和

dplyr::select_if

- 与

all(is.na())

结合使用，但没有成功仅删除列

。我正在寻找一种最好与管道一起使用的方法。我做到这一点的唯一方法不是管道友好：

if(all(is.na(df$d))) df$d <- NULL

编辑：

我期待的结果是一个函数，当我在原始 df 中运行它时，它将返回与原始 df 完全相同的 df：

df
#> # A tibble: 4 x 6
#>       a b      c     d        AR    BR   
#>   <int> <chr>  <chr> <chr>    <chr> <chr>
#> 1     1 animal dog   <NA>     <NA>  <NA> 
#> 2     2 animal cat   <NA>     <NA>  <NA> 
#> 3     3 animal rat   <NA>     <NA>  <NA> 
#> 4     4 plant  oak   carvalho <NA>  <NA>

但在案例栏

中，一切都是不适用的，我期待以下回报：

df
#> # A tibble: 3 x 5
#>       a b      c     AR    BR   
#>   <int> <chr>  <chr> <chr> <chr>
#> 1     1 animal dog   <NA>  <NA> 
#> 2     2 animal cat   <NA>  <NA> 
#> 3     3 animal rat   <NA>  <NA>

Answer 1

我们可以用

select

 包裹

select_if

df %>% 
   filter(b == 'animal') %>% 
   select(select_if(., ~ any(is.na(.))) %>% names %>% setdiff('d'), setdiff(names(.), 'd'))

或者受到@H1评论的启发

library(purrr)
df %>% 
  filter(b == 'animal') %>%
  select_if(names(.) != 'd'| summarise_all(., ~ any(!is.na(.))) %>% flatten_lgl)

Answer 2

您可以使用

select_if()

并使用

%in%

（在多个变量的情况下）测试列名称的条件，并使用

colSums()

计算非 na 值。

df %>%
  filter(b == 'animal') %>%
  select_if(!names(.) %in% "d" | colSums(!is.na(.)) > 0)

# A tibble: 3 x 5
      a b      c     AR    BR   
  <int> <chr>  <chr> <chr> <chr>
1     1 animal dog   NA    NA   
2     2 animal cat   NA    NA   
3     3 animal rat   NA    NA  


df %>%
  select_if(!names(.) %in% "d" | colSums(!is.na(.)) > 0)

# A tibble: 4 x 6
      a b      c     d        AR    BR   
  <int> <chr>  <chr> <chr>    <chr> <chr>
1     1 animal dog   NA       NA    NA   
2     2 animal cat   NA       NA    NA   
3     3 animal rat   NA       NA    NA   
4     4 plant  oak   carvalho NA    NA

Answer 3

有点晚了，但由于这个问题帮助我解决了另一个问题，我想在这里添加一个通用的解决方案。

比方说，除了现有的之外，过滤后不止一列完全变为空：

# library(tidyverse)

#---------------------
df <- tribble(
  ~a,       ~b,    ~c,         ~d,     ~e, ~AR, ~BR,
  1L, "animal", "dog",           NA,   NA,  NA,  NA,
  2L, "animal", "cat",           NA,   NA,  NA,  NA,
  3L, "animal", "rat",           NA,   NA,  NA,  NA,
  4L,  "plant", "oak",   "carvalho", "br",  NA,  NA,
  5L,  "plant", "apple", "macieira", "br",  NA,  NA) %>% 

  mutate_if(is.logical, as.character)

# "Raw" output
> filter(df, b == "animal")
# A tibble: 3 × 7
      a b      c     d     e     AR    BR   
  <int> <chr>  <chr> <chr> <chr> <chr> <chr>
1     1 animal dog   NA    NA    NA    NA   
2     2 animal cat   NA    NA    NA    NA   
3     3 animal rat   NA    NA    NA    NA

通用的解决方案应该丢弃“d”和“e”，但以前的解决方案专门只考虑“d”。考虑到这一点，这是我迟来的、对管道友好的看法：

new_df <- df %$%
  list(
    after = filter(., b == "animal") %$%
      list(
        df      = ., 
        na_cols = colnames(select(., where(\(x) all(is.na(x))))) ),
    
    before = list(
        df      = ., 
        na_cols = colnames(select(., where(\(x) all(is.na(x))))) )) %$%
  
  select(.$after$df, -setdiff(.$after$na_cols, .$before$na_cols))

输出：

> new_df
# A tibble: 3 × 5
      a b      c     AR    BR   
  <int> <chr>  <chr> <chr> <chr>
1     1 animal dog   NA    NA   
2     2 animal cat   NA    NA   
3     3 animal rat   NA    NA

希望它能帮助像这篇文章帮助我一样的人。谢谢！

当且仅当所有值均为 NA 时删除命名列

问题描述投票：0回答：3

3个回答

最新问题

当且仅当所有值均为 NA 时删除命名列

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3