data.frames 如何比矩阵更快?

问题描述 投票:0回答:1

我正在 R 中执行一些计算要求较高的操作,因此我正在寻找最有效的方法来完成这些操作。我的问题是:

  1. 为什么创建 data.frame 看起来比创建矩阵更快?根据我的理解,普遍的共识是,如果所有数据都是同一类型,则矩阵比 data.frames 更快。他们不在这里。
library(dplyr)
library(igraph)
library(bench)

set.seed(123)

edgelist <- data.frame(
  node1 = sample(1:2000, 11000, replace = T),
  node2 = sample(1:2000, 11000, replace = T),
  weight = runif(11000, min = 0, max = 5)
)

g <- graph_from_data_frame(edgelist, directed = F)

#Data.frame
dat <- function() {

  dm <- distances(g, weight = E(g)$weight)
  
  UTIndex <- which(upper.tri(dm), arr.ind = T)

  df1 <- data.frame(
    verticeA = as.numeric(rownames(dm)[UTIndex[, 1]]),
    verticeB = as.numeric(colnames(dm)[UTIndex[, 2]]),
    path_length = as.numeric(dm[UTIndex])
  )
}

#Matrix
mat <- function() {
  
  dm <- distances(g, weight = E(g)$weight)
  
  UTIndex <- which(upper.tri(dm), arr.ind = T)
  
  df1 <- cbind(
    verticeA = as.numeric(rownames(dm)[UTIndex[, 1]]),
    verticeB = as.numeric(colnames(dm)[UTIndex[, 2]]),
    path_length = as.numeric(dm[UTIndex])
  )
}
####

results <- bench::mark(
  dat = dat(),
  mat = mat(),
  check = F
)

t1 <- system.time({
  df1 <- dat()
})

rm(df1)

t2 <- system.time({
  df1 <- mat()
})

rm(df1)

这是

t1
t2
results
的输出:

> results
# A tibble: 2 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 dat           2.89s    2.89s     0.346     269MB     1.39     1     4      2.89s
2 mat           2.83s    2.83s     0.353     315MB     1.41     1     4      2.83s
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
> t1
   user  system elapsed 
   2.78    0.04    2.81 
> t2
   user  system elapsed 
   3.12    0.08    3.21 
r dataframe performance microbenchmark edge-list
1个回答
0
投票

当您进行基准测试时,您没有充分隔离您正在调查的元素(data.frame() 和 cbind() 之间的差异)函数调用;您运行的任一测试中的大部分计算都是除 data.frame 和 cbind() 之外的所有内容,并且您进行的测试比较相对较少,这意味着您可能会将随机变化误认为是显着差异。

下面我隔离出了与基准测试无关的公共部分,仅保留相关部分;但我更进一步展开关于 R 内存管理的讨论。

让我们从代码重写开始:

library(dplyr)
library(igraph)
library(bench)

set.seed(123)

edgelist <- data.frame(
  node1 = sample(1:2000, 11000, replace = T),
  node2 = sample(1:2000, 11000, replace = T),
  weight = runif(11000, min = 0, max = 5)
)

g <- graph_from_data_frame(edgelist, directed = F)

dm <- distances(g, weight = E(g)$weight)

UTIndex <- which(upper.tri(dm), arr.ind = T)
verticeA <- as.numeric(rownames(dm)[UTIndex[, 1]])
verticeB  <- as.numeric(colnames(dm)[UTIndex[, 2]])     
path_length <- as.numeric(dm[UTIndex])

#Data.frame
datpure <- function(verticeA,verticeB,path_length) {
  data.frame(
    verticeA=verticeA,
    verticeB=verticeB,
    path_length =path_length
  )
}
datpure(verticeA,verticeB,path_length)

#Matrix
matpure <- function(verticeA,verticeB,path_length) {
   cbind(
    verticeA=verticeA,
    verticeB=verticeB,
    path_length =path_length
  )
}
matpure(verticeA,verticeB,path_length)
####

adjust_to_1s <- function(x){
  x[,] <- 1
  x
}

(results <- bench::mark(
  dat_pure = dat(verticeA,verticeB,path_length),
  mat_pure = mat(verticeA,verticeB,path_length),
  dat_1 = adjust_to_1s(dat(verticeA,verticeB,path_length)),
  mat_1 = adjust_to_1s(mat(verticeA,verticeB,path_length)),
  check = F,
  iterations = 100L,
  time_unit = 'ms'
))
+ ))
# A tibble: 4 × 13
  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory              time       gc      
  <bch:expr>   <dbl>   <dbl>     <dbl> <bch:byt>    <dbl> <int> <dbl>      <dbl> <list> <list>              <list>     <list>  
1 dat_pure     0.169   0.178   5350.          0B   54.0      99     1       18.5 <NULL> <Rprofmem [11 × 3]> <bench_tm> <tibble>
2 mat_pure    10.7    12.4       72.9     45.8MB    0.736    99     1     1358.  <NULL> <Rprofmem [2 × 3]>  <bench_tm> <tibble>
3 dat_1      182.    254.         4.00   292.8MB    0.347    92     8    23023.  <NULL> <Rprofmem>          <bench_tm> <tibble>
4 mat_1       30.3    37.9       23.8     53.4MB    0.241    99     1     4153.  <NULL> <Rprofmem [2 × 3]>  <bench_tm> <tibble>

从中我们可以看出,data.frame() 调用本身比 cbind 快了一个数量级。我认为为什么是内存布局? 对于 data.frame() R 只需要一个 list/data.frame() 并将现有向量与名称相关联; R 是只写时复制,否则通过引用工作,因此 data.frame 构造本质上是对元数据的微不足道的更改。 而 cbind 创建一个矩阵,它本质上是一个向量,因此必须复制数据并布局。

我添加了一个变体,在最初的纯 data.frame 和 cbind 调用之后,我们从根本上改变了对象(将每个条目设置为 1) 现在,在这两种情况下,R 都必须写入内存,并且速度都会变慢。 data.frame 表现更差。

© www.soinside.com 2019 - 2024. All rights reserved.