依次执行两个自连接以过滤数据框中的两列

Question

问题

让我们假设以下数据框：

library(dplyr)

dat <- tibble(
    Structure = c("A", "B", "X", "A-X", "B-X", "C-X", "A-Y"), 
    FirstComponent = c(NA, NA, NA, "A", "B", "C", "A"), 
    SecondComponent = c(NA, NA, NA, "X", "X", "X", "Y"),
    IsValid = c(FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE))

dat
# A tibble: 7 × 4
  Structure FirstComponent SecondComponent IsValid
  <chr>     <chr>          <chr>           <lgl>  
1 A         NA             NA              FALSE  
2 B         NA             NA              FALSE  
3 X         NA             NA              FALSE  
4 A-X       A              X               TRUE   
5 B-X       B              X               TRUE   
6 C-X       C              X               FALSE  
7 A-Y       A              Y               FALSE

目标是构建一个

dplyr

管道，使用列

Structure

、

FirstComponent

和

SecondComponent

减少此数据框，以便仅保留行

IsValid == TRUE

：Only those rows should be retained for which

FirstComponent

和

SecondComponent

的内容可以在数据框中任意位置的

Structure

列中找到：

行 1-3 是无效因为
```
FirstComponent
```
和
```
SecondComponent
```
是
```
NA
```
第 4 行和第 5 行是 valid，因为
```
FirstComponent
```
和
```
SecondComponent
```
中的两个值都可以作为
```
Structure
```
.
第 6 行是 invalid，因为
```
C
```
中的值
```
FirstComponent
```
不会作为
```
Structure
```
的值出现在任何地方。
第 7 行是 invalid，因为
```
Y
```
中的值
```
SecondComponent
```
不会作为
```
Structure
```
的值出现在其他任何地方。

我目前的想法（灵感来自相关问题“How to join a data frame to itself within a dplyr chain?”）是使用

Structure

和来自

FirstComponent

的信息执行自连接，以便在

FirstComponent

中具有值的行在

Structure

中没有匹配值被消除，然后将结果提供给第二个自连接，对

SecondComponent

执行相同的操作。

有效但不实用的代码

如上所述，目标是在单个

dplyr

管道内实现过滤。我找到了一个违反此约束的解决方案，因为它假定

dat

作为持久对象存在，可以作为

dplyr

方法的第一个参数显式提供。在我针对此问题的实际用例中，这不是给定的。因此，虽然以下代码确实产生了预期的结果，但它并不是真正的问题解决方案：

inner_join(
    x=inner_join(
        x=dat,
        y=select(dat, FirstComponent=Structure)),
    y=select(dat, SecondComponent=Structure))

Joining, by = "FirstComponent"
Joining, by = "SecondComponent"
# A tibble: 2 × 4
  Structure FirstComponent SecondComponent IsValid
  <chr>     <chr>          <chr>           <lgl>  
1 A-X       A              X               TRUE   
2 B-X       B              X               TRUE

失败的解决方案 1

简单地添加管道并替换对

的所有引用会产生错误，大概是因为

占位符超出嵌套

inner_join()

的范围：

dat %>%
    inner_join(
        x=inner_join(
            x=.,
            y=select(., FirstComponent=Structure)),
        y=select(., SecondComponent=Structure))
Joining, by = "FirstComponent"
Error: Can't subset columns that don't exist.
✖ Columns `x` and `y` don't exist.
Run `rlang::last_error()` to see where the error occurred.

失败的解决方案 2

我试图通过沿着管道以严格连续的方式执行两个自连接来消除嵌套

inner_join()

。但这也不起作用，因为在达到第二个内连接时，

的内容值已经减少到排除第 1-3 行，因此第二个自连接的最终结果匹配

 SecondComponent

到

Structure

的剩余值产生一个空数据框：

dat %>%
    inner_join(
        x=.,
        y=select(., FirstComponent=Structure)) %>%
    inner_join(
        x=., 
        y=select(., SecondComponent=Structure))

# A tibble: 0 × 4
# … with 4 variables: Structure <chr>, FirstComponent <chr>, SecondComponent <chr>,
#   IsValid <lgl>

我不知道如何从这里开始。我考虑过使用

magrittr

s

%T>%

操作员提供的聪明的 T 型管道来解决解决方案 2 的问题，但到目前为止没有成功。

有什么建议吗？

Answer 1

我猜你正在寻找一个

filter

在同一数据框的列中查找（假设

Structure

不会有任何

NA

值）：

library(dplyr)

dat %>% 
  filter(FirstComponent %in% Structure, SecondComponent %in% Structure)

#> # A tibble: 2 × 4
#>   Structure FirstComponent SecondComponent IsValid
#>   <chr>     <chr>          <chr>           <lgl>  
#> 1 A-X       A              X               TRUE   
#> 2 B-X       B              X               TRUE

依次执行两个自连接以过滤数据框中的两列

问题描述投票：0回答：1

1个回答

最新问题

依次执行两个自连接以过滤数据框中的两列

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1