使用分数表,基于与另一个字符串的比较,用数字对字符串进行编码的有效方法是什么?

问题描述 投票:0回答:0

假设我们有一个分数表 -

score_matrix = t(data.frame('A' = c('A' = 1,'T' = 1,'G' = 2,'C' = 2),
                            'T' = c('A' = 1,'T' = 1,'G' = 2,'C' = 2),
                            'G' = c('A' = 2,'T' = 2,'G' = 1,'C' = 1),
                            'C' = c('A' = 2,'T' = 2,'G' = 1,'C' = 1)))

>
  A T G C
A 1 1 2 2
T 1 1 2 2
G 2 2 1 1
C 2 2 1 1

现在我想比较等长的多个字符串...

Query = rawToChar(as.raw(sample(c(65,67,71,84), 25, replace=T)))
Subject = rawToChar(as.raw(sample(c(65,67,71,84), 25, replace=T)))

> Query
[1] "TTATACCAGTGTATGATGAGCCTCG"
> Subject
[1] "GTAGCTCACGAATATATGAACCTCA"

...并将主题字符串与查询匹配,并根据上面的矩阵将其转换为一系列分数,即 -

2 1 1 2 2 2 1 1 1 2 2 1 1 1 2 1 1 1 1 2 1 1 1 1 2

上面比较的代码-

for (i in 1:length(unlist(strsplit(Query,"")))) { temp = cat(score_mat[unlist(strsplit(Query,""))[i],unlist(strsplit(Subject,""))[i]],"") }

我的实际集合会是一个更大的集合,例如矩阵格式 -

data_matrix = matrix(unlist(strsplit(Query,"")),nrow = 1)
data_matrix = rbind(data_matrix,matrix(unlist(strsplit(Subject,"")),nrow = 1))
for(i in 1:23) {
   data_matrix = rbind(data_matrix,
         matrix(unlist(strsplit(rawToChar(as.raw(sample(c(65,67,71,84), 25, replace=T))),"")),
         nrow = 1)) }

> dim(data_matrix)
[1] 25 25

我可以在嵌套循环中单独比较字母,但效率很低。我试过这样的东西 -

for (i in 2:nrow(data_matrix)) {
   for (j in 1:ncol(data_matrix)) {
      data_matrix[i,j] = score_matrix[data_matrix[i,j],data_matrix[1,j]] }

但是对于大约 5000 X 5000 矩阵的真实数据,这非常慢。作为参考,她是该矩阵 25 X 25 的基准。我的数据集将花费指数级更长的时间。

microbenchmark(for (i in 2:nrow(data_matrix)) {
    for (j in 1:ncol(data_matrix)) {
        temp2[i,j] = score_mat[data_matrix[i,j],data_matrix[1,j]]}})

Unit: milliseconds
                                                                                                                                                                                        expr
 <the command above>
      min       lq     mean   median       uq      max neval
 133.0899 159.8918 189.5858 173.7305 208.5522 348.1761   100

解决这个问题的更有效方法是什么?

> sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_UK.utf8  LC_CTYPE=English_UK.utf8    LC_MONETARY=English_UK.utf8 LC_NUMERIC=C                   LC_TIME=English_UK.utf8    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] microbenchmark_1.4.9    stringi_1.7.8      doParallel_1.0.17    iterators_1.0.14   foreach_1.5.2    fs_1.5.2 
 [7] S4Vectors_0.34.0        data.table_1.14.6  forcats_0.5.2        stringr_1.5.0      dplyr_1.0.10     purrr_0.3.5 
[13] readr_2.1.3             tidyr_1.2.1        tibble_3.1.8         ggplot2_3.4.0      tidyverse_1.3.2     
r sequence string-comparison
© www.soinside.com 2019 - 2024. All rights reserved.