来自随机正态分布的并行样本 - 不是更快？

Question

我正在使用 R 创建一个模拟，从随机正态分布中获取样本，毫不奇怪，它相当慢。因此，我寻找一些使用 Rcpp 来加速的方法，并遇到了用于更快随机正态样本的 RcppZiggurat 包，以及用于多线程计算的 RcppParallel 包，我想，为什么不同时使用更快的算法并并行抽取样本？

于是我开始原型设计，最后得出了三种方法来比较：

同时使用 RcppParallel 和 RcppZiggurat 的示例
仅使用 RcppZiggurat 的示例
使用旧的样品
```
rnorm
```

下面是我使用 RcppParallel + RcppZiggurat （

parallelDraws

函数）和 RcppZiggurat （

serialDraws

函数）的实现：

#include <Rcpp.h>
// [[Rcpp::plugins("cpp11")]]
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
// [[Rcpp::depends(RcppZiggurat)]]
#include <Ziggurat.h>

static Ziggurat::Ziggurat::Ziggurat zigg;

using namespace RcppParallel;

struct Norm : public Worker
{   
  int input;

  // saved draws
  RVector<double> draws;

  // constructors
  Norm(const int input, Rcpp::NumericVector draws)
    : input(input), draws(draws) {}

  void operator()(std::size_t begin, std::size_t end) {
    for (std::size_t i = begin; i < end; i++) {
      draws[i] = zigg.norm();
    }
  }
};

// [[Rcpp::export]]
Rcpp::NumericVector parallelDraws(int x) {

  // allocate the output vector
  Rcpp::NumericVector draws(x);

  // declare the Norm instance 
  Norm norm(x, draws);

  // call parallelFor to start the work
  parallelFor(0, x, norm);

  // return the draws
  return draws;
};

// [[Rcpp::export]]
Rcpp::NumericVector serialDraws(int x) {

  // allocate the output vector
  Rcpp::NumericVector draws(x);

  for (int i = 0; i < x; i++) {
    draws[i] = zigg.norm();
  }

  // return the draws
  return draws;
};

当我对它们进行基准测试时，我发现了一些令人惊讶的结果：

library(microbenchmark)
microbenchmark(parallelDraws(1e5), serialDraws(1e5), rnorm(1e5))

Unit: microseconds
                 expr      min       lq     mean    median       uq        max neval
 parallelDraws(1e+05) 3113.752 3539.686 3687.794 3599.1540 3943.282   5058.376   100
   serialDraws(1e+05)  695.501  734.593 2536.940  757.2325  806.135 175712.496   100
         rnorm(1e+05) 6072.043 6264.030 6655.835 6424.0195 6661.739  18578.669   100

单独使用 RcppZiggurat 比

rnorm

快约 8 倍，但同时使用 RcppParallel 和 RcppZiggurat 比单独使用 RcppZiggurat 慢！我尝试使用 RcppParallel ParallelFor 函数的

grain size

，但它并没有带来任何明显的改进。

我的问题是：添加并行性实际上更糟糕的原因可能是什么？我知道并行计算中的“开销”可能会超过其好处，具体取决于各种因素。这就是这里发生的事情吗？或者我完全误解了如何有效使用 RcppParallel 包？

Answer 1

正如评论中提到的，开销可能会产生问题，特别是当整体运行时间很短时，最好不要将输出向量初始化为零并使用线程本地 RNG。实施示例：

#include <Rcpp.h>
// [[Rcpp::plugins("cpp11")]]
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
// [[Rcpp::depends(RcppZiggurat)]]
#include <Ziggurat.h>


using namespace RcppParallel;

struct Norm : public Worker
{   
  // saved draws
  RVector<double> draws;
  
  // constructors
  Norm(Rcpp::NumericVector draws)
    : draws(draws) {}
  
  void operator()(std::size_t begin, std::size_t end) {
    Ziggurat::Ziggurat::Ziggurat zigg(end);
    for (std::size_t i = begin; i < end; i++) {
      draws[i] = zigg.norm();
    }
  }
};

// [[Rcpp::export]]
Rcpp::NumericVector parallelDraws(int x) {
  // allocate the output vector
  Rcpp::NumericVector draws(Rcpp::no_init(x));
  Norm norm(draws);
  parallelFor(0, x, norm);
  return draws;
}

// [[Rcpp::export]]
Rcpp::NumericVector serialDraws(int x) {
  // allocate the output vector
  Rcpp::NumericVector draws(Rcpp::no_init(x));
  Ziggurat::Ziggurat::Ziggurat zigg(42);
  for (int i = 0; i < x; i++) {
    draws[i] = zigg.norm();
  }
  return draws;
}

请注意，我正在使用“穷人的并行 RNG”，即不同线程的不同种子，并希望得到最好的结果。我使用

end

作为种子，因为

begin

可能为零，而且我不确定 RcppZiggurat 中的 RNG 是否喜欢这样。由于创建一个

Ziggurat

对象需要一些时间（和内存），因此我还使用本地对象来保证串行计算的公平性。

对于 10^5 次随机抽取，使用并行计算仍然没有任何收益：

> bench::mark(parallelDraws(1e5), serialDraws(1e5), check = FALSE, min_iterations = 10)[,1:5]
# A tibble: 2 x 5
  expression                min   median `itr/sec` mem_alloc
  <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 parallelDraws(1e+05)   1.08ms   1.78ms      558.     784KB
2 serialDraws(1e+05)   624.16µs  758.6µs     1315.     784KB

但是对于 10^8 次绘制，我在双核笔记本电脑上获得了很好的加速：

> bench::mark(parallelDraws(1e8), serialDraws(1e8), check = FALSE, min_iterations = 10)[,1:5]
# A tibble: 2 x 5
  expression                min   median `itr/sec` mem_alloc
  <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 parallelDraws(1e+08)    326ms    343ms      2.91     763MB
2 serialDraws(1e+08)      757ms    770ms      1.30     763MB

因此，使用并行计算是否有意义在很大程度上取决于您需要的随机抽取数量。

顺便说一句，在评论中提到了我的 dqrng 包。该软件包还使用 Ziggurat 方法进行正常（和指数）抽奖，并结合非常快的 64 位 RNG，使其串行速度与正常抽奖的 RcppZiggurat 相当。此外，所使用的RNG是用于并行计算，即不需要希望通过使用不同的种子来获得不重叠的随机流。

来自随机正态分布的并行样本 - 不是更快？

问题描述投票：0回答：1

1个回答

最新问题

来自随机正态分布的并行样本 - 不是更快？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1