使用 R 过滤相似的 DNA 序列

问题描述 投票:0回答:1

我有两个表,其中包含早期阶段 (TIMEPOINT_1) 和后期阶段 (TIMEPOINT_2) 的 DNA 序列。我想从 TIMEPOINT_2 表中筛选 TIMEPOINT_1 表中相似度阈值为 95% 的序列。我尝试过使用“stringdistmatrix”函数并创建相似性矩阵,但没有达到预期的结果。 R 有没有办法做到这一点?

这是一个表结构的示例:

# Creating df TIMEPOINT_1
sequences <- c(
  "ACCTTCAGGCAACCTTCAGGCA",
  "ACCTTCGAGCAGCCATCAGGCA",
  "ACCCGTCCTAGGATCGATCAGGCA",
  "TCGAAGTGCATGCATGCTTACGTA",
  "CGTGCAAAGCGTGACGTTAGCGT")
sequence_names <- c("time1_seq1", "time1_seq2", "time1_seq3", "time1_seq4", "time1_seq5")
TIMEPOINT_1 <- data.frame(name = sequence_names, sequence = sequences)

# Creating df TIMEPOINT_2
sequences <- c(
  "ACCTTCGGGCAACCTTCAGGCA",
  "ACCTTCGTGCGGGCCATCAGGCA",
  "ACCCGTCCTAGGATCGATCAGGCA",
  "TCGAAGTGCATGCATGCTTAAGTA",
  "CGTGCAAAGCGTGACTGCACGTGGT")
sequence_names <- c("time2_seq1", "time2_seq2", "time2_seq3", "time2_seq4", "time2_seq5")
TIMEPOINT_2 <- data.frame(name = sequence_names, sequence = sequences)

预期结果:TIMEPOINT_2 表包含 TIMEPOINT_1 表中的匹配序列。

r sequence similarity
1个回答
0
投票

如果我很好地理解你的目标,我会执行一个简单的内部合并:

df <- merge(TIMEPOINT_1, TIMEPOINT_2, by = "sequence", all = F)
df
© www.soinside.com 2019 - 2024. All rights reserved.