我正在尝试遵循这个答案,将两个表与范围连接起来:https://stackoverflow.com/a/46341899/6636572
我想连接两个表,其中一个有一些范围,另一个是数字,我想通过匹配这些数字在另一个数据帧中的范围来注释数字。
然而桌子的长度却令人惊讶地增加了。我希望它与 2832 保持相同。发生了什么事?如何解决这个问题?如果我使用
fuzzy_left_join
,结果不会改变
> head(gene_list_selected)
chr start_pos end_pos gene_name
1 1 55013806 55100417 ACOT11
2 1 55074849 55089200 FAM151A
3 1 55107412 55175940 MROH7
4 1 55107412 55208328 MROH7-TTC4
5 1 55181494 55208328 TTC4
6 1 55222570 55230226 PARS2
> head(df)
rsid pos
1 1:55013860:C:T 55013860
2 1:55013957:G:A 55013957
3 1:55014013:C:T 55014013
4 1:55014095:C:T 55014095
5 1:55014099:C:T 55014099
6 1:55014100:G:A 55014100
> nrow(gene_list_selected)
[1] 21
> nrow(df)
[1] 2832
> df_with_gene_name<-df %>%
+ fuzzy_inner_join(gene_list_selected, by = c("pos"="start_pos","pos"="end_pos"), match_fun = list(`>=`, `<=`))
>
> nrow(df_with_gene_name)
[1] 3298
df_with_gene_name 比 df 具有更多行的原因是因为 Gene_name 值的 start_pos 和 end_pos 并不互斥。因此,df 中符合多个gene_name 条件的记录将返回多个行。请参阅下面的示例作为说明。您尚未表明要如何处理多个匹配项,但
pivot_wider()
在这些情况下很有用,并且将为每个 rsid 提供单行:
library(fuzzyjoin)
library(dplyr)
library(tidyr)
gene_list_selected <- read.table(text = "chr start_pos end_pos gene_name
1 55013806 55100417 ACOT11
1 55074849 55089200 FAM151A
1 55107412 55175940 MROH7
1 55107412 55208328 MROH7-TTC4
1 55181494 55208328 TTC4
1 55222570 55230226 PARS2", header = TRUE)
df <- read.table(text = " rsid pos
1:55013860:C:T 55013860
1:55013957:G:A 55013957
1:55014013:C:T 55014013
1:55014095:C:T 55014095
1:55014099:C:T 55014099
1:55014100:G:A 55014100
1:dupmatch:E.G 55107413", header = TRUE)
# Fuzzy join and pivot
df_with_gene_name <- df %>%
fuzzy_inner_join(gene_list_selected,
by = c("pos"="start_pos","pos"="end_pos"),
match_fun = list(`>=`, `<=`)) %>%
pivot_wider(id_cols = rsid,
names_from = gene_name,
values_from = pos)
df_with_gene_name
# A tibble: 7 × 4
rsid ACOT11 MROH7 `MROH7-TTC4`
<chr> <int> <int> <int>
1 1:55013860:C:T 55013860 NA NA
2 1:55013957:G:A 55013957 NA NA
3 1:55014013:C:T 55014013 NA NA
4 1:55014095:C:T 55014095 NA NA
5 1:55014099:C:T 55014099 NA NA
6 1:55014100:G:A 55014100 NA NA
7 1:dupmatch:E.G NA 55107413 55107413