按索引范围连接两个表,但表长度增加了

问题描述 投票:0回答:1

我正在尝试遵循这个答案,将两个表与范围连接起来:https://stackoverflow.com/a/46341899/6636572

我想连接两个表,其中一个有一些范围,另一个是数字,我想通过匹配这些数字在另一个数据帧中的范围来注释数字。

然而桌子的长度却令人惊讶地增加了。我希望它与 2832 保持相同。发生了什么事?如何解决这个问题?如果我使用

fuzzy_left_join

,结果不会改变
> head(gene_list_selected)
  chr start_pos  end_pos  gene_name
1   1  55013806 55100417     ACOT11
2   1  55074849 55089200    FAM151A
3   1  55107412 55175940      MROH7
4   1  55107412 55208328 MROH7-TTC4
5   1  55181494 55208328       TTC4
6   1  55222570 55230226      PARS2
> head(df)
            rsid      pos
1 1:55013860:C:T 55013860
2 1:55013957:G:A 55013957
3 1:55014013:C:T 55014013
4 1:55014095:C:T 55014095
5 1:55014099:C:T 55014099
6 1:55014100:G:A 55014100
> nrow(gene_list_selected)
[1] 21
> nrow(df)
[1] 2832
> df_with_gene_name<-df %>% 
+     fuzzy_inner_join(gene_list_selected, by = c("pos"="start_pos","pos"="end_pos"), match_fun = list(`>=`, `<=`)) 
> 
> nrow(df_with_gene_name)
[1] 3298
join dplyr tidyverse fuzzyjoin
1个回答
0
投票

df_with_gene_name 比 df 具有更多行的原因是因为 Gene_name 值的 start_pos 和 end_pos 并不互斥。因此,df 中符合多个gene_name 条件的记录将返回多个行。请参阅下面的示例作为说明。您尚未表明要如何处理多个匹配项,但

pivot_wider()
在这些情况下很有用,并且将为每个 rsid 提供单行:

library(fuzzyjoin)
library(dplyr)
library(tidyr)

gene_list_selected <- read.table(text = "chr start_pos  end_pos  gene_name
  1  55013806 55100417     ACOT11
  1  55074849 55089200    FAM151A
  1  55107412 55175940      MROH7
  1  55107412 55208328 MROH7-TTC4
  1  55181494 55208328       TTC4
  1  55222570 55230226      PARS2", header = TRUE)

df <- read.table(text = "          rsid      pos
1:55013860:C:T 55013860
1:55013957:G:A 55013957
1:55014013:C:T 55014013
1:55014095:C:T 55014095
1:55014099:C:T 55014099
1:55014100:G:A 55014100
1:dupmatch:E.G 55107413", header = TRUE)

# Fuzzy join and pivot
df_with_gene_name <- df %>%
  fuzzy_inner_join(gene_list_selected, 
                   by = c("pos"="start_pos","pos"="end_pos"),
                   match_fun = list(`>=`, `<=`)) %>%
  pivot_wider(id_cols = rsid,
              names_from = gene_name,
              values_from = pos)

df_with_gene_name
# A tibble: 7 × 4
           rsid    ACOT11    MROH7 `MROH7-TTC4`
           <chr>    <int>    <int>        <int>
1 1:55013860:C:T 55013860       NA           NA
2 1:55013957:G:A 55013957       NA           NA
3 1:55014013:C:T 55014013       NA           NA
4 1:55014095:C:T 55014095       NA           NA
5 1:55014099:C:T 55014099       NA           NA
6 1:55014100:G:A 55014100       NA           NA
7 1:dupmatch:E.G       NA 55107413     55107413
© www.soinside.com 2019 - 2024. All rights reserved.