R：提高R

Question

目的：

past

向量是在过去一段时间内找到的

o3.cpt

的集合。

current

向量是在当前周期内找到的

o3.cpt

的集合。每个

o3.cpt

都有一组与代码相关联的数字（

numbers

数据框中的 PC）。我正在尝试获取每对过去和当前

o3.cpt

的PC差异之和。在原始数据集中，

numbers

数据框有 700 多个 PC 列。

问题：

下面的代码是我目前正在使用的，但是运行速度太慢（我需要重复这个过程超过70万次）。有没有更有效的方法来计算这些

sum_diff

？

示例数据集：

  past = sample(1:40, 20, replace=F)
  current = sample(41:80, 20, replace=F)
  
  set.seed(100)
  
  numbers <- 
    data.frame(
      o3.cpt = c(past, current),
      PC1 = runif(length(c(past, current)), min = -3, max = 3),
      PC2 = runif(length(c(past, current)), min = 1, max = 10),
      PC3 = runif(length(c(past, current)), min = -4, max = 2),
      PC4 = runif(length(c(past, current)), min = -9, max = 8),
      PC5 = runif(length(c(past, current)), min =  4, max = 9)
      )
  
  pairs <- 
    expand.grid(
      past,
      current
    )

当前代码：

if(nrow(pairs) > 0){
    sum_diff <- 
      pairs %>% 
      ddply(c('Var1', 'Var2'), function(j){
        
        p <-  
          numbers %>% 
          filter(o3.cpt == j$Var1[1]) %>% 
          select(starts_with('PC')) %>% 
          t %>% 
          data.frame %>% 
          .$.
        
        c<-  
          numbers %>% 
          filter(o3.cpt == j$Var2[1]) %>% 
          select(starts_with('PC')) %>% 
          t %>% 
          data.frame %>% 
          .$.
        
        data.frame(diff = sum(abs(p - c)))
      
    })
  }

Answer 1

这是使用重塑数据和向量化函数进行的快速重写，对于此示例数据，速度提高了大约 60 倍（从 8 秒到 0.13 秒），并产生相同的

diff

列。

当我将

past

/

current

长度从 20 增加到 100 时，性能优势增加到 2000 倍，使用下面的

dtplyr

变化：0.09 秒而不是 185 秒。

library(tidyverse)
pairs_rows <- pairs %>% mutate(row = row_number())
pairs_rows %>%
  left_join(
    pairs_rows %>% 
      pivot_longer(-row) %>%
      left_join(numbers, by = c("value" = "o3.cpt")) %>%
      group_by(row) %>%
      summarize(across(starts_with("PC"), ~abs(diff(.)))) %>%
      mutate(diff = rowSums(across(starts_with("PC"))))
    ) %>%
  arrange(Var1, Var2) # to replicate example output order

将其包装在

dtplyr

中的一些额外收益，我认为对于更大的数据，这会大得多：

library(dtplyr)
pairs_rows <- pairs %>% mutate(row = row_number()) %>% lazy_dt()
# <same last block>

Answer 2

这是由

past

和

current

观察值组成的矩阵之间的成对曼哈顿距离。

Rfast::dista

是一个非常快速的实现。以下运行不到 0.0002 秒。

pairs$diff <- c(
  Rfast::dista(
    as.matrix(numbers[1:length(past), -1]),
    as.matrix(numbers[(length(past) + 1):nrow(numbers), -1]),
    type = "manhattan"
  )
)

时间：

library(Rfast)
microbenchmark::microbenchmark(
  dista = {pairs$diff <- c(
    dista(
      as.matrix(numbers[1:length(past), -1]),
      as.matrix(numbers[(length(past) + 1):nrow(numbers), -1]),
      type = "manhattan"
    )
  )}
)
#> Unit: microseconds
#>   expr     min      lq    mean  median       uq     max neval
#>  dista 174.101 177.252 186.202 179.201 181.7515 540.901   100

R：提高R

问题描述投票：0回答：2

目的：

问题：

示例数据集：

当前代码：

2个回答

最新问题

R：提高R

问题描述 投票：0回答：2

目的：

问题：

示例数据集：

当前代码：

2个回答

最新问题

问题描述投票：0回答：2