为了满足高性能数学库的需求,我一直在对各种方法进行基准测试,以在 Rust vecs 上进行就地操作(最好通过引用)。这些方法是:
map()
,收集到新的 vec,并覆盖现有的 for_each()
这是基准测试代码:
use std::time::{Instant, Duration};
const N_ITEMS: usize = 100000;
const N_BENCH_TIMES: i32 = 100;
fn bench_simple() {
let mut a: Vec<i32> = vec![5; N_ITEMS];
for i in a.iter_mut() {
*i += 1;
}
}
fn bench_iterator() {
let mut a: Vec<i32> = vec![5; N_ITEMS];
let a: Vec<i32> = a.iter_mut().map(|x| *x + 1).collect();
}
fn bench_foreach() {
let mut a: Vec<i32> = vec![5; N_ITEMS];
a.iter_mut().for_each(|x| *x += 1);
}
fn benchmark(name: &str, f: &dyn Fn()) {
println!("Benchmark: {}", name);
let mut time = Duration::new(0, 0);
for _ in 0..N_BENCH_TIMES {
let start_time = Instant::now();
f();
time += start_time.elapsed();
}
println!("Total: {:?} over {} instances", time, N_BENCH_TIMES);
println!("Average: {:?}", time.div_f64(N_BENCH_TIMES as f64));
}
fn main() {
benchmark("Simple", &bench_simple);
benchmark("Iterator", &bench_iterator);
benchmark("ForEach", &bench_foreach);
}
这些是结果:
Benchmark: Simple
Total: 402.619146ms over 100 instances
Average: 4.026191ms
Benchmark: Iterator
Total: 587.769728ms over 100 instances
Average: 5.877697ms
Benchmark: ForEach
Total: 409.527297ms over 100 instances
Average: 4.095273ms
Rust 文档说:
“在某些情况下,
也可能比循环更快,因为它将在像for_each
这样的适配器上使用内部迭代”Chain
所以我预计
for_each()
比 for 循环快,并且都比 map()
快得多,后者需要创建一个新向量并使用它来覆盖现有向量。但在基准测试中显然并非如此?我对各个迭代器如何工作的理解是否不正确?
我非常有信心您没有使用
cargo run --release
进行构建,从而使您的整个基准测试完全毫无意义。
这是我的一台相当旧的笔记本电脑上的基准测试输出,没有
--release
:
Benchmark: Simple
Total: 545.9198ms over 100 instances
Average: 5.459198ms
Benchmark: Iterator
Total: 461.5787ms over 100 instances
Average: 4.615787ms
Benchmark: ForEach
Total: 334.7625ms over 100 instances
Average: 3.347625ms
这是与
--release
:
Benchmark: Simple
Total: 10.6028ms over 100 instances
Average: 106.028µs
Benchmark: Iterator
Total: 176.4µs over 100 instances
Average: 1.764µs
Benchmark: ForEach
Total: 5.8382ms over 100 instances
Average: 58.382µs
那么这是怎么回事?
最终的组装揭示了一些情况:
example::bench_iterator:
mov rax, qword ptr [rip + __rust_no_alloc_shim_is_unstable@GOTPCREL]
movzx ecx, byte ptr [rax]
mov ecx, 99968
test rcx, rcx
je .LBB1_3
.LBB1_2:
add rcx, -32
test rcx, rcx
jne .LBB1_2
.LBB1_3:
movzx eax, byte ptr [rax]
ret
编译器几乎完全优化了
Iterator
基准测试。
black_box
存在的原因。它告诉编译器不要优化掉某些东西。
这是您的代码,在适当的位置插入了适当的
black_box
es:
use std::hint::black_box;
use std::time::{Duration, Instant};
const N_ITEMS: usize = 100000;
const N_BENCH_TIMES: i32 = 100;
pub fn bench_simple() {
let mut a: Vec<i32> = black_box(vec![5; N_ITEMS]);
for i in a.iter_mut() {
*i += 1;
}
black_box(a);
}
pub fn bench_iterator() {
let mut a: Vec<i32> = black_box(vec![5; N_ITEMS]);
let a: Vec<i32> = a.iter_mut().map(|x| *x + 1).collect();
black_box(a);
}
pub fn bench_foreach() {
let mut a: Vec<i32> = black_box(vec![5; N_ITEMS]);
a.iter_mut().for_each(|x| *x += 1);
black_box(a);
}
fn benchmark(name: &str, f: &dyn Fn()) {
println!("Benchmark: {}", name);
let mut time = Duration::new(0, 0);
for _ in 0..N_BENCH_TIMES {
let start_time = Instant::now();
f();
time += start_time.elapsed();
}
println!("Total: {:?} over {} instances", time, N_BENCH_TIMES);
println!("Average: {:?}", time.div_f64(N_BENCH_TIMES as f64));
}
fn main() {
benchmark("Simple", &bench_simple);
benchmark("Iterator", &bench_iterator);
benchmark("ForEach", &bench_foreach);
}
那么现在的输出是什么?以下是我在笔记本电脑上的几次运行:
Benchmark: Simple
Total: 13.9338ms over 100 instances
Average: 139.338µs
Benchmark: Iterator
Total: 4.9353ms over 100 instances
Average: 49.353µs
Benchmark: ForEach
Total: 7.554ms over 100 instances
Average: 75.54µs
Benchmark: Simple
Total: 6.1412ms over 100 instances
Average: 61.412µs
Benchmark: Iterator
Total: 12.5715ms over 100 instances
Average: 125.715µs
Benchmark: ForEach
Total: 14.4137ms over 100 instances
Average: 144.137µs
如您所见,变化是巨大。
criterion
,它可以进行预热和适当的动态重复计数。
这是您用
criterion
重写的代码:
请注意,这不会进入
main.rs
,而是进入 benches/
目录中的源文件。有关更多信息,请阅读标准用户指南)。
use criterion::{criterion_group, criterion_main, Criterion};
use std::hint::black_box;
const N_ITEMS: usize = 100000;
pub fn run_simple(mut a: Vec<i32>) -> Vec<i32> {
for i in a.iter_mut() {
*i += 1;
}
a
}
pub fn run_iterator(mut a: Vec<i32>) -> Vec<i32> {
a.iter_mut().map(|x| *x + 1).collect()
}
pub fn run_foreach(mut a: Vec<i32>) -> Vec<i32> {
a.iter_mut().for_each(|x| *x += 1);
a
}
pub fn bench_simple(c: &mut Criterion) {
let mut a = Some(vec![5; N_ITEMS]);
c.bench_function("Simple", |b| {
b.iter(|| {
let data = black_box(a.take().unwrap());
let data = run_simple(data);
a = Some(black_box(data));
})
});
}
pub fn bench_iterator(c: &mut Criterion) {
let mut a = Some(vec![5; N_ITEMS]);
c.bench_function("Iterator", |b| {
b.iter(|| {
let data = black_box(a.take().unwrap());
let data = run_iterator(data);
a = Some(black_box(data));
})
});
}
pub fn bench_foreach(c: &mut Criterion) {
let mut a = Some(vec![5; N_ITEMS]);
c.bench_function("Foreach", |b| {
b.iter(|| {
let data = black_box(a.take().unwrap());
let data = run_foreach(data);
a = Some(black_box(data));
})
});
}
criterion_group!(benches, bench_simple, bench_iterator, bench_foreach);
criterion_main!(benches);
Simple time: [23.375 µs 23.508 µs 23.657 µs]
Iterator time: [48.229 µs 48.762 µs 49.368 µs]
Foreach time: [23.271 µs 23.384 µs 23.528 µs]
请注意,
Simple
和Foreach
几乎相同,而Iterator
似乎更慢。这是为什么?
原因是您的
Iterator
实现使用了iter_mut
,这使原始数组保持活动状态。然而,这并不是必需的,因此我们可以使用 into_iter
来代替,如下所示:
pub fn run_iterator(a: Vec<i32>) -> Vec<i32> {
a.into_iter().map(|x| x + 1).collect()
}
使用新版本重新运行基准测试会产生:
use criterion::{criterion_group, criterion_main, Criterion};
use std::hint::black_box;
const N_ITEMS: usize = 100000;
pub fn run_simple(mut a: Vec<i32>) -> Vec<i32> {
for i in a.iter_mut() {
*i += 1;
}
a
}
pub fn run_iterator(a: Vec<i32>) -> Vec<i32> {
a.into_iter().map(|x| x + 1).collect()
}
pub fn run_foreach(mut a: Vec<i32>) -> Vec<i32> {
a.iter_mut().for_each(|x| *x += 1);
a
}
pub fn bench_simple(c: &mut Criterion) {
let mut a = Some(vec![5; N_ITEMS]);
c.bench_function("Simple", |b| {
b.iter(|| {
let data = black_box(a.take().unwrap());
let data = run_simple(data);
a = Some(black_box(data));
})
});
}
pub fn bench_iterator(c: &mut Criterion) {
let mut a = Some(vec![5; N_ITEMS]);
c.bench_function("Iterator", |b| {
b.iter(|| {
let data = black_box(a.take().unwrap());
let data = run_iterator(data);
a = Some(black_box(data));
})
});
}
pub fn bench_foreach(c: &mut Criterion) {
let mut a = Some(vec![5; N_ITEMS]);
c.bench_function("Foreach", |b| {
b.iter(|| {
let data = black_box(a.take().unwrap());
let data = run_foreach(data);
a = Some(black_box(data));
})
});
}
criterion_group!(benches, bench_simple, bench_iterator, bench_foreach);
criterion_main!(benches);
Simple time: [23.561 µs 24.038 µs 24.657 µs]
change: [-0.7221% +3.2799% +7.4864%] (p = 0.11 > 0.05)
No change in performance detected.
Iterator time: [23.180 µs 23.244 µs 23.318 µs]
change: [-52.691% -49.895% -46.210%] (p = 0.00 < 0.05)
Performance has improved.
Foreach time: [23.217 µs 23.351 µs 23.515 µs]
change: [-3.0721% -0.0025% +3.1169%] (p = 1.00 > 0.05)
No change in performance detected.
现在您可以看到这三个都是相同的。这是预期的,因为 Rust 迭代器(如果使用得当)是零成本抽象。