我有 2 个包含短(长度 == 20)序列的数据帧,我想将它们与字符串距离分析技术进行比较,返回汉明距离不大于 3 的高度相似的序列(即查询和查询之间的替换不超过 3 个)主题序列)。 fuzzyjoin::stringdist_join() 很好地完成了这项任务,但它无法处理我想要比较的序列数量(每个数据帧中的数万到数十万个序列),除非我对查询序列进行分块。当我的数据框位于较大的一侧时,此策略开始需要一整天的时间来使用下面的代码执行。
有什么方法可以将 fuzzyjoin 或 stringdist 包与 data.table 一起使用来加快速度并保留内存吗?我不断尝试各种事情,但它们导致执行速度更慢。
library(tidyverse)
library(fuzzyjoin)
### simulate data ###
chars <- c("A", "C", "G", "T")
nq <- 50051
ns <- 54277
query <- data.frame(name = str_c("q", 1:nq),
seq = replicate(nq, sample(chars, 20, replace = T) %>% paste0(collapse = "")))
subject <- data.frame(name = str_c("s", 1:ns),
seq = replicate(ns, sample(chars, 20, replace = T) %>% paste0(collapse = "")))
### return seqs with 3 or less mismatches ###
# # NOT ENOUGH MEMORY
# stringdist_join(query, subject,
# by = "seq",
# method = "hamming",
# mode = "left",
# max_dist = 3,
# distance_col = "mismatches")
# chunk query values to preserve memory
query <- query %>%
mutate(grp = (plyr::round_any(row_number(), 100)/100)+1)
# get a variable of all groups
var.grps <- unique(query$grp)
# create an output list
df_out <- purrr::map_df(var.grps, function(i) {
q <- filter(query, grp == i)
dat <- stringdist_join(q, subject,
by = "seq",
max_dist = 3,
method = "hamming",
mode = "left",
ignore_case = TRUE,
distance_col = "mismatch")
return(dat)
})
我发现了: stringdist_join() 在幕后使用 stringdistmatrix() 。仅使用 stringdistmatrix() 并从中收集所需信息要快得多。为了克服内存问题,我使用初始空矩阵对查询序列进行分块。
# make stringdist matrix
chunk_size <- 1000
num_rows <- nrow(query)
# Initialize an empty matrix
sdm <- matrix(0, nrow = num_rows, ncol = nrow(subject))
# Loop through the rows in chunks
for (start_row in seq(1, num_rows, by = chunk_size)) {
end_row <- min(start_row + chunk_size - 1, num_rows)
# Subset the rows for the current chunk
chunk_query <- query$seq[start_row:end_row]
# Compute stringdist matrix for the current chunk
chunk_sdm <- stringdistmatrix(chunk_query, subject$seq, method = "hamming")
# Assign the chunk_sdm to the corresponding rows in the main sdm matrix
sdm[start_row:end_row, ] <- chunk_sdm
}
rownames(sdm) <- query$name
colnames(sdm) <- subject$name
# find the indices where the values are 3 or less
indices <- which(sdm <= 3, arr.ind = TRUE)
# extract row names, col names, and values based on the indices
result <- data.frame(query = rownames(sdm)[indices[, 1]],
subject = colnames(sdm)[indices[, 2]],
mismatch = sdm[indices])