如何最大化R fuzzyjoin/stringdist速度和内存效率

问题描述 投票:0回答:1

我有 2 个包含短(长度 == 20)序列的数据帧,我想将它们与字符串距离分析技术进行比较,返回汉明距离不大于 3 的高度相似的序列(即查询和查询之间的替换不超过 3 个)主题序列)。 fuzzyjoin::stringdist_join() 很好地完成了这项任务,但它无法处理我想要比较的序列数量(每个数据帧中的数万到数十万个序列),除非我对查询序列进行分块。当我的数据框位于较大的一侧时,此策略开始需要一整天的时间来使用下面的代码执行。

有什么方法可以将 fuzzyjoin 或 stringdist 包与 data.table 一起使用来加快速度并保留内存吗?我不断尝试各种事情,但它们导致执行速度更慢。

library(tidyverse)
library(fuzzyjoin)

### simulate data ###

chars <- c("A", "C", "G", "T")
nq <- 50051
ns <- 54277
query <- data.frame(name = str_c("q", 1:nq), 
                    seq = replicate(nq, sample(chars, 20, replace = T) %>% paste0(collapse = "")))
subject <- data.frame(name = str_c("s", 1:ns),
                      seq = replicate(ns, sample(chars, 20, replace = T) %>% paste0(collapse = "")))

### return seqs with 3 or less mismatches ###

# # NOT ENOUGH MEMORY
# stringdist_join(query, subject,
#                 by = "seq",
#                 method = "hamming",
#                 mode = "left",
#                 max_dist = 3,
#                 distance_col = "mismatches")

# chunk query values to preserve memory
query <- query %>%
  mutate(grp = (plyr::round_any(row_number(), 100)/100)+1)

# get a variable of all groups
var.grps <- unique(query$grp)

# create an output list
df_out <- purrr::map_df(var.grps, function(i) {
  q <- filter(query, grp == i)
  dat <- stringdist_join(q, subject,
                         by = "seq",
                         max_dist = 3,
                         method = "hamming",
                         mode = "left",
                         ignore_case = TRUE,
                         distance_col = "mismatch")
  return(dat)
})
r data.table bioconductor stringdist fuzzyjoin
1个回答
0
投票

我发现了: stringdist_join() 在幕后使用 stringdistmatrix() 。仅使用 stringdistmatrix() 并从中收集所需信息要快得多。为了克服内存问题,我使用初始空矩阵对查询序列进行分块。

# make stringdist matrix 
chunk_size <- 1000
num_rows <- nrow(query)

# Initialize an empty matrix
sdm <- matrix(0, nrow = num_rows, ncol = nrow(subject))

# Loop through the rows in chunks
for (start_row in seq(1, num_rows, by = chunk_size)) {
  end_row <- min(start_row + chunk_size - 1, num_rows)
  
  # Subset the rows for the current chunk
  chunk_query <- query$seq[start_row:end_row]
  
  # Compute stringdist matrix for the current chunk
  chunk_sdm <- stringdistmatrix(chunk_query, subject$seq, method = "hamming")
  
  # Assign the chunk_sdm to the corresponding rows in the main sdm matrix
  sdm[start_row:end_row, ] <- chunk_sdm
}
rownames(sdm) <- query$name
colnames(sdm) <- subject$name

# find the indices where the values are 3 or less
indices <- which(sdm <= 3, arr.ind = TRUE)

# extract row names, col names, and values based on the indices
result <- data.frame(query = rownames(sdm)[indices[, 1]],
                     subject = colnames(sdm)[indices[, 2]],
                     mismatch = sdm[indices])
© www.soinside.com 2019 - 2024. All rights reserved.