如何按字符串列半联接两个数据帧,其中一个冒号分隔

问题描述 投票:1回答:2

我有两个数据帧,dfadfb

dfa <- data.frame(
  gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
  id = c(1:5)
)

dfb <- data.frame(
  gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"),
  id = c(6:10)
)

看起来像这样:

> dfa
  gene_name id
1     MUC16  1
2      MUC2  2
3       MET  3
4      FAT1  4
5      TERT  5

> dfb
  gene_name id
1      MUC1  6
2 MET; BLEP  7
3     MUC21  8
4       FAT  9
5      TERT 10

dfa是我感兴趣的基因列表:我想保留dfb行出现的位置,注意数字(MUC1not MUC16)。我的new_df应该看起来像这样:

> new_df
  gene_name id
1 MET; BLEP  7
2      TERT 10

[我的问题是,常规dplyr::semi_join()确实匹配,这没有考虑到dfb$gene_names可以包含用"; "分隔的基因的事实。意味着在此示例中,未保留"MET"

我曾尝试研究fuzzyjoin::regex_semi_join,但无法使其按我的意愿做...

tidyverse解决方案将受到欢迎。 (也许用stringr?!)

r dplyr fuzzyjoin semi-join
2个回答
3
投票

您可以在加入前使用seperate_rows()分割数据帧。请注意,如果BLEP存在于dfa中,则将导致重复,这就是为什么使用distinct]

dfa <- data.frame(
  gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
  id = c(1:5),
  stringsAsFactors = FALSE
)

dfb <- data.frame(
  gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"),
  id = c(6:10),
  stringsAsFactors = FALSE
)


library(tidyverse)

dfb%>%
  mutate(new_col = gene_name)%>%
  separate_rows(new_col,sep = "; ")%>%
  semi_join(dfa,by = c("new_col" = "gene_name"))%>%
  select(gene_name,id)%>%
  distinct()



0
投票

这里是使用stringrpurrr的解决方案。

library(tidyverse)

dfb %>%
 mutate(gene_name_list = str_split(gene_name, "; ")) %>%
 mutate(gene_of_interest = map_lgl(gene_name_list, some, ~ . %in% dfa$gene_name)) %>%
 filter(gene_of_interest == TRUE) %>%
 select(gene_name, id)
© www.soinside.com 2019 - 2024. All rights reserved.