就地向量修改的理论与实际性能的混淆

问题描述 投票:0回答:1

为了满足高性能数学库的需求,我一直在对各种方法进行基准测试,以在 Rust vecs 上进行就地操作(最好通过引用)。这些方法是:

  • 使用显式 for 循环
  • 使用迭代器,然后使用
    map()
    ,收集到新的 vec,并覆盖现有的
  • 使用迭代器然后使用
    for_each()

这是基准测试代码:

use std::time::{Instant, Duration};

const N_ITEMS: usize = 100000;
const N_BENCH_TIMES: i32 = 100;

fn bench_simple() {
    let mut a: Vec<i32> = vec![5; N_ITEMS];
    for i in a.iter_mut() {
        *i += 1;
    }
}

fn bench_iterator() {
    let mut a: Vec<i32> = vec![5; N_ITEMS];
    let a: Vec<i32> = a.iter_mut().map(|x| *x + 1).collect();
}

fn bench_foreach() {
    let mut a: Vec<i32> = vec![5; N_ITEMS];
    a.iter_mut().for_each(|x| *x += 1);
}

fn benchmark(name: &str, f: &dyn Fn()) {
    println!("Benchmark: {}", name);
    let mut time = Duration::new(0, 0);
    for _ in 0..N_BENCH_TIMES {
        let start_time = Instant::now();
        f();
        time += start_time.elapsed();
    }
    println!("Total: {:?} over {} instances", time, N_BENCH_TIMES);
    println!("Average: {:?}", time.div_f64(N_BENCH_TIMES as f64));
}

fn main() {
    benchmark("Simple", &bench_simple);
    benchmark("Iterator", &bench_iterator);
    benchmark("ForEach", &bench_foreach);
}

这些是结果:

Benchmark: Simple
Total: 402.619146ms over 100 instances
Average: 4.026191ms
Benchmark: Iterator
Total: 587.769728ms over 100 instances
Average: 5.877697ms
Benchmark: ForEach
Total: 409.527297ms over 100 instances
Average: 4.095273ms

Rust 文档说:

“在某些情况下,

for_each
也可能比循环更快,因为它将在像
Chain
这样的适配器上使用内部迭代”

所以我预计

for_each()
比 for 循环快,并且都比
map()
快得多,后者需要创建一个新向量并使用它来覆盖现有向量。但在基准测试中显然并非如此?我对各个迭代器如何工作的理解是否不正确?

rust vector benchmarking
1个回答
0
投票

我非常有信心您没有使用

cargo run --release
进行构建,从而使您的整个基准测试完全毫无意义。

这是我的一台相当旧的笔记本电脑上的基准测试输出,没有

--release

Benchmark: Simple
Total: 545.9198ms over 100 instances
Average: 5.459198ms
Benchmark: Iterator
Total: 461.5787ms over 100 instances
Average: 4.615787ms
Benchmark: ForEach
Total: 334.7625ms over 100 instances
Average: 3.347625ms

这是

--release

Benchmark: Simple
Total: 10.6028ms over 100 instances
Average: 106.028µs
Benchmark: Iterator
Total: 176.4µs over 100 instances
Average: 1.764µs
Benchmark: ForEach
Total: 5.8382ms over 100 instances
Average: 58.382µs

那么这是怎么回事?

最终的组装揭示了一些情况:

example::bench_iterator:
        mov     rax, qword ptr [rip + __rust_no_alloc_shim_is_unstable@GOTPCREL]
        movzx   ecx, byte ptr [rax]
        mov     ecx, 99968
        test    rcx, rcx
        je      .LBB1_3
.LBB1_2:
        add     rcx, -32
        test    rcx, rcx
        jne     .LBB1_2
.LBB1_3:
        movzx   eax, byte ptr [rax]
        ret

编译器几乎完全优化了

Iterator
基准测试。

这就是

black_box
存在的原因。它告诉编译器不要优化掉某些东西。

这是您的代码,在适当的位置插入了适当的

black_box
es:

use std::hint::black_box;
use std::time::{Duration, Instant};

const N_ITEMS: usize = 100000;
const N_BENCH_TIMES: i32 = 100;

pub fn bench_simple() {
    let mut a: Vec<i32> = black_box(vec![5; N_ITEMS]);
    for i in a.iter_mut() {
        *i += 1;
    }
    black_box(a);
}

pub fn bench_iterator() {
    let mut a: Vec<i32> = black_box(vec![5; N_ITEMS]);
    let a: Vec<i32> = a.iter_mut().map(|x| *x + 1).collect();
    black_box(a);
}

pub fn bench_foreach() {
    let mut a: Vec<i32> = black_box(vec![5; N_ITEMS]);
    a.iter_mut().for_each(|x| *x += 1);
    black_box(a);
}

fn benchmark(name: &str, f: &dyn Fn()) {
    println!("Benchmark: {}", name);
    let mut time = Duration::new(0, 0);
    for _ in 0..N_BENCH_TIMES {
        let start_time = Instant::now();
        f();
        time += start_time.elapsed();
    }
    println!("Total: {:?} over {} instances", time, N_BENCH_TIMES);
    println!("Average: {:?}", time.div_f64(N_BENCH_TIMES as f64));
}

fn main() {
    benchmark("Simple", &bench_simple);
    benchmark("Iterator", &bench_iterator);
    benchmark("ForEach", &bench_foreach);
}

那么现在的输出是什么?以下是我在笔记本电脑上的几次运行:

Benchmark: Simple
Total: 13.9338ms over 100 instances
Average: 139.338µs
Benchmark: Iterator
Total: 4.9353ms over 100 instances
Average: 49.353µs
Benchmark: ForEach
Total: 7.554ms over 100 instances
Average: 75.54µs
Benchmark: Simple
Total: 6.1412ms over 100 instances
Average: 61.412µs
Benchmark: Iterator
Total: 12.5715ms over 100 instances
Average: 125.715µs
Benchmark: ForEach
Total: 14.4137ms over 100 instances
Average: 144.137µs

如您所见,变化是巨大

为了缓解这种情况,我们需要使用适当的基准板条箱,例如

criterion
,它可以进行预热和适当的动态重复计数。

这是您用

criterion
重写的代码:

请注意,这不会进入

main.rs
,而是进入
benches/
目录中的源文件。有关更多信息,请阅读标准用户指南)。

use criterion::{criterion_group, criterion_main, Criterion};
use std::hint::black_box;

const N_ITEMS: usize = 100000;

pub fn run_simple(mut a: Vec<i32>) -> Vec<i32> {
    for i in a.iter_mut() {
        *i += 1;
    }
    a
}

pub fn run_iterator(mut a: Vec<i32>) -> Vec<i32> {
    a.iter_mut().map(|x| *x + 1).collect()
}

pub fn run_foreach(mut a: Vec<i32>) -> Vec<i32> {
    a.iter_mut().for_each(|x| *x += 1);
    a
}

pub fn bench_simple(c: &mut Criterion) {
    let mut a = Some(vec![5; N_ITEMS]);
    c.bench_function("Simple", |b| {
        b.iter(|| {
            let data = black_box(a.take().unwrap());
            let data = run_simple(data);
            a = Some(black_box(data));
        })
    });
}

pub fn bench_iterator(c: &mut Criterion) {
    let mut a = Some(vec![5; N_ITEMS]);
    c.bench_function("Iterator", |b| {
        b.iter(|| {
            let data = black_box(a.take().unwrap());
            let data = run_iterator(data);
            a = Some(black_box(data));
        })
    });
}

pub fn bench_foreach(c: &mut Criterion) {
    let mut a = Some(vec![5; N_ITEMS]);
    c.bench_function("Foreach", |b| {
        b.iter(|| {
            let data = black_box(a.take().unwrap());
            let data = run_foreach(data);
            a = Some(black_box(data));
        })
    });
}

criterion_group!(benches, bench_simple, bench_iterator, bench_foreach);
criterion_main!(benches);
Simple                  time:   [23.375 µs 23.508 µs 23.657 µs]
Iterator                time:   [48.229 µs 48.762 µs 49.368 µs]
Foreach                 time:   [23.271 µs 23.384 µs 23.528 µs]

请注意,

Simple
Foreach
几乎相同,而
Iterator
似乎更慢。这是为什么?

原因是您的

Iterator
实现使用了
iter_mut
,这使原始数组保持活动状态。然而,这并不是必需的,因此我们可以使用
into_iter
来代替,如下所示:

pub fn run_iterator(a: Vec<i32>) -> Vec<i32> {
    a.into_iter().map(|x| x + 1).collect()
}

使用新版本重新运行基准测试会产生:

use criterion::{criterion_group, criterion_main, Criterion};
use std::hint::black_box;

const N_ITEMS: usize = 100000;

pub fn run_simple(mut a: Vec<i32>) -> Vec<i32> {
    for i in a.iter_mut() {
        *i += 1;
    }
    a
}

pub fn run_iterator(a: Vec<i32>) -> Vec<i32> {
    a.into_iter().map(|x| x + 1).collect()
}

pub fn run_foreach(mut a: Vec<i32>) -> Vec<i32> {
    a.iter_mut().for_each(|x| *x += 1);
    a
}

pub fn bench_simple(c: &mut Criterion) {
    let mut a = Some(vec![5; N_ITEMS]);
    c.bench_function("Simple", |b| {
        b.iter(|| {
            let data = black_box(a.take().unwrap());
            let data = run_simple(data);
            a = Some(black_box(data));
        })
    });
}

pub fn bench_iterator(c: &mut Criterion) {
    let mut a = Some(vec![5; N_ITEMS]);
    c.bench_function("Iterator", |b| {
        b.iter(|| {
            let data = black_box(a.take().unwrap());
            let data = run_iterator(data);
            a = Some(black_box(data));
        })
    });
}

pub fn bench_foreach(c: &mut Criterion) {
    let mut a = Some(vec![5; N_ITEMS]);
    c.bench_function("Foreach", |b| {
        b.iter(|| {
            let data = black_box(a.take().unwrap());
            let data = run_foreach(data);
            a = Some(black_box(data));
        })
    });
}

criterion_group!(benches, bench_simple, bench_iterator, bench_foreach);
criterion_main!(benches);
Simple                  time:   [23.561 µs 24.038 µs 24.657 µs]
                        change: [-0.7221% +3.2799% +7.4864%] (p = 0.11 > 0.05)
                        No change in performance detected.

Iterator                time:   [23.180 µs 23.244 µs 23.318 µs]
                        change: [-52.691% -49.895% -46.210%] (p = 0.00 < 0.05)
                        Performance has improved.

Foreach                 time:   [23.217 µs 23.351 µs 23.515 µs]
                        change: [-3.0721% -0.0025% +3.1169%] (p = 1.00 > 0.05)
                        No change in performance detected.

现在您可以看到这三个都是相同的。这是预期的,因为 Rust 迭代器(如果使用得当)是零成本抽象。

© www.soinside.com 2019 - 2024. All rights reserved.