优化组合计算到列表中 - 大数据集

问题描述 投票:0回答:1

我想知道是否有人可以找到一种更快的方法来计算向量中元素的组合。我的方法有效,但速度很慢,向量中有大约 600 万个元素。

测试向量

test.vector <- c("335261 344015 537633","22404 132858","254654 355860 488288","219943 373817","331839 404477")

我的方法

lapply(strsplit(test.vector, " "), function(x) unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))))

预期产量

[[1]]
[1] "335261344015" "335261537633" "344015537633"

[[2]]
[1] "22404132858"

[[3]]
[1] "254654355860" "254654488288" "355860488288"

[[4]]
[1] "219943373817"

[[5]]
[1] "331839404477"
r optimization combinations
1个回答
3
投票

这是一个比OP在大型测试用例上的解决方案快得多的答案。它不依赖于

paste
,而是我们利用数字和向量化运算的属性。我们还使用
comboGeneral
包(我是作者)中的
RcppAlgos
,它比链接答案中的
combn
combn_prim
快得多,用于生成向量组合。首先我们展示
comboGeneral
相对于其他函数的效率增益:

## library(gRbase)
library(RcppAlgos)
library(microbenchmark)
options(digits = 4)
options(width = 90)
options(scipen = 99)

microbenchmark(
    gRBase = gRbase::combn_prim(300, 2),
    utils  = combn(300, 2),
    RcppAlgos = comboGeneral(300, 2),
    unit = "relative"
)
#> Warning in microbenchmark(gRBase = gRbase::combn_prim(300, 2), utils = combn(300, : less
#> accurate nanosecond times to avoid potential integer overflows
#> Unit: relative
#>       expr   min     lq  mean median     uq     max neval
#>     gRBase 104.6  93.79 122.1  91.82  88.26 1259.00   100
#>      utils 141.8 127.72 126.2 125.70 128.44   88.04   100
#>  RcppAlgos   1.0   1.00   1.0   1.00   1.00    1.00   100

现在,我们创建一个函数来创建一些随机的可重现数据,这些数据将传递给我们的测试函数:

makeTestSet <- function(vectorSize, elementSize, mySeed = 42, withRep = FALSE) {
    set.seed(mySeed)
    sapply(1:vectorSize, function(x) {
        paste(sample(10^6, s1 <- sample(2:elementSize, 1),
                     replace = withRep), collapse = " ")
    })
}

makeTestSet(5, 3)
#> [1] "280897 566346"        "74362 46208 964632"   "226905 326254"       
#> [4] "626546 926595 586329" "222181 946797"

看起来不错。现在,让我们看看设置

fixed = TRUE
是否会给我们带来任何收益(如@MichaelChirico 上面所建议的):

bigVec <- makeTestSet(10, 100000)

microbenchmark(standard = strsplit(bigVec, " "),
               withFixed = strsplit(bigVec, " ", fixed = TRUE),
               times = 15, unit = "relative")
#> Unit: relative
#>       expr  min    lq  mean median    uq   max neval
#>   standard 6.18 6.134 5.197  6.048 5.678 2.921    15
#>  withFixed 1.00 1.000 1.000  1.000 1.000 1.000    15

@MichaelChirico 说得很对。把它们放在一起我们得到:

combPairFast <- function(testVec) {
    lapply(strsplit(testVec, " ", fixed = TRUE), function(x) {
        combs <- RcppAlgos::comboGeneral(as.numeric(x), 2)
        unique(combs[,1] * (10)^(as.integer(log10(combs[,2])) + 1L) + combs[,2])
    })
}

## test.vector defined above by OP
test.vector <- c(
    "335261 344015 537633", "22404 132858",
    "254654 355860 488288", "219943 373817",
    "331839 404477"
)

combPairFast(test.vector)
#> [[1]]
#> [1] 335261344015 335261537633 344015537633
#> 
#> [[2]]
#> [1] 22404132858
#> 
#> [[3]]
#> [1] 254654355860 254654488288 355860488288
#> 
#> [[4]]
#> [1] 219943373817
#> 
#> [[5]]
#> [1] 331839404477

## OP original code
combPairOP <- function(testVec) {
    lapply(strsplit(testVec, " "), function(x) unique(
        apply(combn(x, 2), 2, function(y) paste0(y, collapse = "")))
    )
}

正如OP的评论中所述,最大数量小于一百万(确切地说是600000),这意味着我们将其中一个数字最多乘以10^6并将其添加到另一个6位数字(相当于简单地连接两个数字字符串),我们保证在基本 R 的数值精度内(即

2^53 - 1
)。这很好,因为数字上的算术运算比字符串运算要高效得多。

剩下的就是进行基准测试:

test.vector <- makeTestSet(100, 50)

microbenchmark(
    original = combPairOP(test.vector),
    new_impl = combPairFast(test.vector),
    times = 20,
    unit = "relative"
)
#> Unit: relative
#>      expr   min    lq  mean median    uq  max neval
#>  original 31.68 31.55 23.85  30.46 29.14 11.2    20
#>  new_impl  1.00  1.00  1.00   1.00  1.00  1.0    20

对于更大的向量:

bigTest.vector <- makeTestSet(1000, 100, mySeed = 22, withRep = TRUE)

## Duplicate values exist
any(sapply(strsplit(bigTest.vector, " ", fixed = TRUE), function(x) {
    any(duplicated(x))
}))
#> [1] TRUE

system.time(t1 <- combPairFast(bigTest.vector))
#>    user  system elapsed 
#>   0.081   0.004   0.086

system.time(t2 <- combPairOP(bigTest.vector))
#>    user  system elapsed 
#>   4.519   0.040   4.560

## results are the same
all.equal(t1, lapply(t2, as.numeric))
#> [1] TRUE
© www.soinside.com 2019 - 2024. All rights reserved.