对于足够小的问题，简单的流循环显示出比 DRAM B/W 更高的有效 B/W

Question

我正在做一些“冷”微基准测试，即一个函数被调用50次，它的数据在每次运行中新分配，每个工作线程将分配给它的数据的每个内存页中的第一个数据归零，从而排除测量页面错误的可能性。

这是最小的重现器（针对 Linux 系统）：

/// \file This file contains a micro-benchmark program for the SAXPY loop.

#include <string> // std::stoul
#include <utility> // std::abort
#include <iostream> // std::cerr
#include <cstdint> // std::uint64_t
#include <algorithm> // std::min
#include <limits> // std::numeric_limits

#include <omp.h>

#include <stdlib.h> // posix_memalign
#include <unistd.h> // sysconf

void saxpy(const float& a, const float* vec_x, float* vec_y, const std::size_t& n)
{
    #pragma omp parallel for schedule(static)
    for (auto ii = std::size_t{}; ii < n; ++ii)
    {
        vec_y[ii] += vec_x[ii] * a; // fp_count: 2, traffic: 2+1
    }
}

int main(int argc, char** argv)
{
    // extract the problem size
    if (argc < 2)
    {
        std::cerr << "Please provide the problem size as command line argument." << std::endl;
        return 1;
    }
    const auto n = static_cast<std::size_t>(std::stoul(argv[1]));
    if (n < 1)
    {
        std::cerr << "Zero valued problem size provided. The program will now be aborted." << std::endl;
        return 1;
    }
    if (n * sizeof(float) / (1024 * 1024 * 1024) > 40) // let's assume there's only 64 GiB of RAM
    {
        std::cerr << "Problem size is too large. The program will now be aborted." << std::endl;
        return 1;
    }
    
    // report
    std::cout << "Starting runs with problem size n=" << n << ".\nThread count: " << omp_get_max_threads() << "."
        << std::endl;
    
    // details
    const auto page_size = sysconf(_SC_PAGESIZE);
    const auto page_size_float = page_size / sizeof(float);
    
    // experiment loop
    const auto experiment_count = 50;
    const auto warm_up_count = 10;
    const auto run_count = experiment_count + warm_up_count;
    auto durations = std::vector(experiment_count, std::numeric_limits<std::uint64_t>::min());
    const auto a = 10.f;
    float* vec_x = nullptr;
    float* vec_y = nullptr;
    for (auto run_index = std::size_t{}; run_index < run_count; ++run_index)
    {
        // allocate
        const auto alloc_status0 = posix_memalign(reinterpret_cast<void**>(&vec_x), page_size, n * sizeof(float));
        const auto alloc_status1 = posix_memalign(reinterpret_cast<void**>(&vec_y), page_size, n * sizeof(float));
        if (alloc_status0 != 0 || alloc_status1 != 0 || vec_x == nullptr || vec_y == nullptr)
        {
            std::cerr << "Fatal error, failed to allocate memory." << std::endl;
            std::abort();
        }
        
        // "first touch"
        #pragma omp parallel for schedule(static)
        for (auto ii = std::size_t{}; ii < n; ii += page_size_float)
        {
            vec_x[ii] = 0.f;
            vec_y[ii] = 0.f;
        }
        
        // run experiment
        const auto t1 = omp_get_wtime();
        saxpy(a, vec_x, vec_y, n);
        const auto t2 = omp_get_wtime();
        const auto duration_in_us = static_cast<std::int64_t>((t2 - t1) * 1E+6);
        if (duration_in_us <= 0)
        {
            std::cerr << "Fatal error, no time elapsed in the test function." << std::endl;
            std::abort();
        }
        if (run_index + 1 > warm_up_count)
        {
            durations[run_index - warm_up_count] = static_cast<std::uint64_t>(duration_in_us);
        }
        
        // deallocate
        std::free(vec_x);
        std::free(vec_y);
        vec_x = nullptr;
        vec_y = nullptr;
    }
    
    // statistics
    auto min = std::numeric_limits<std::uint64_t>::max();
    auto max = std::uint64_t{};
    auto mean = std::uint64_t{};
    for (const auto& duration : durations)
    {
        min = std::min(min, duration);
        max = std::max(max, duration);
        mean += duration;
    }
    mean /= experiment_count;
    
    // report duration
    std::cout << "Mean duration:      " << mean << " us\n"
        << "Min. duration:      " << min << " us\n"
        << "Max. duration:      " << max << " us.\n";
    
    // compute effective B/W
    const auto traffic = 3 * n * sizeof(float);
    constexpr auto inv_gigi = 1.0 / static_cast<double>(1024 * 1024 * 1024);
    const auto traffic_in_gib = static_cast<double>(traffic) * inv_gigi;
    std::cout << "Traffic per run:    " << traffic << " B (" << traffic_in_gib << " GiB)\n" 
        << "Mean effective B/W: " << static_cast<double>(traffic_in_gib) / (static_cast<double>(mean) * 1E-6) << " GiB/s\n"
        << "Min. effective B/W: " << static_cast<double>(traffic_in_gib) / (static_cast<double>(max) * 1E-6) << " GiB/s\n"
        << "Max. effective B/W: " << static_cast<double>(traffic_in_gib) / (static_cast<double>(min) * 1E-6) << " GiB/s\n"
        << std::endl;
    
    return 0;
}

现在我有兴趣了解有效带宽如何表现。问题的大小。所以我做了一些运行，输出如下：

Starting run for n=1000000.
Starting runs with problem size n=1000000.
Thread count: 6.
Mean duration:      148 us
Min. duration:      117 us
Max. duration:      417 us.
Traffic per run:    12000000 B (0.0111759 GiB)
Mean effective B/W: 75.5126 GiB/s
Min. effective B/W: 26.8006 GiB/s
Max. effective B/W: 95.5203 GiB/s

----------------------

Starting run for n=10000000.
Starting runs with problem size n=10000000.
Thread count: 6.
Mean duration:      3311 us
Min. duration:      3262 us
Max. duration:      3382 us.
Traffic per run:    120000000 B (0.111759 GiB)
Mean effective B/W: 33.7538 GiB/s
Min. effective B/W: 33.0452 GiB/s
Max. effective B/W: 34.2608 GiB/s

----------------------

Starting run for n=100000000.
Starting runs with problem size n=100000000.
Thread count: 6.
Mean duration:      32481 us
Min. duration:      32137 us
Max. duration:      36431 us.
Traffic per run:    1200000000 B (1.11759 GiB)
Mean effective B/W: 34.4074 GiB/s
Min. effective B/W: 30.6768 GiB/s
Max. effective B/W: 34.7757 GiB/s

----------------------

作为参考，我使用了英特尔® MLC，它产生了以下输出：

Intel(R) Memory Latency Checker - v3.9a
*** Unable to modify prefetchers (try executing 'modprobe msr')
*** So, enabling random access for latency measurements
Measuring idle latencies (in ns)...
        Numa node
Numa node        0  
       0      57.1  

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :  41302.4 
3:1 Reads-Writes :  37185.7 
2:1 Reads-Writes :  36779.7 
1:1 Reads-Writes :  36790.3 
Stream-triad like:  37241.2

问题： 对于

n=1000000

的问题大小，我们的有效带宽 (~75 GiB/s) 为何高于我系统的最大读写带宽 (~37 GiB/s)？如果您需要更多信息，请询问，我很乐意提供。预先感谢。

附加信息： 这是在 Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz 上运行，具有 6 核（无 SMT），缓存大小为 32 KiB、1.5 MiB 和 9 MiB，最高矢量标志为 AVX2。

Answer 1

页面错误处理程序将您的页面写入零，因此您的初始化循环实际上确实接触所有页面中的所有缓存行，从而使一些缓存命中成为可能。

12 MB (11.44 MiB) 仅比六核 Coffee Lake i5-8400 的 9 MiB L3 缓存大小稍大一点。您仍然可以在“幸运”运行时获得显着的 L3 命中，并且多线程 L3 带宽远高于 DRAM 带宽是正常的。

大多数带宽与工作集大小的图都符合一条曲线，并且不会采样非常接近缓存大小的很多点，但我认为即使您这样做了，它也不是非常陡峭的悬崖是正常的。特别是对于 L3 大小的限制，尽管曲线的部分原因是来自其他核心的系统争夺。但是，从略大于 L3 大小的数组中获得一些命中的可能性比使用 L2 缓存的可能性更大：不同的逐出算法，包括自 Ivy Bridge 以来英特尔 L3 中的自适应替换，也许可以决定将一些 L3 行保留在一组中，甚至当许多其他人被驱逐时。是的，有些跑步可能比其他跑步幸运得多，这是有道理的。

当我们每页只写一行时，我们如何获得缓存命中？

这些是从内核新分配的页面，因为 glibc 的分配器将使用

mmap(MAP_ANONYMOUS)

 进行大型映射。 （如果它们超出了该大小阈值，则将它们重新释放出来，而不仅仅是将它们放入用户空间的空闲列表中。）Linux

mmap

 是“懒惰”

1^{，没有物理页面支持虚拟分配直到你触摸它。}

如果第一次访问是只读的，它将被写入时复制映射到零的共享物理页。（因此，如果使用透明大页面，您可能会获得 L1d 命中 + TLB 未命中，或者 L3 命中。）请参阅

性能评估的惯用方式？以获取有关此内容的更多链接。

写入时（无论是第一次访问还是读取后），页错误处理程序将找到一个空闲的物理页，

将其归零，然后连接页表以指向它，并返回到用户空间进行存储指令重新运行。将页面清零会使其在各级缓存中变得很热。（或者如果使用 2M 大页面，则大多只是 L3）。

故障周围将对周围页面执行相同的操作，可能会将页面错误数量减少到顺序访问所触及的页面数量的 1/8。但它们仍然会由内核作为 init 循环的一部分写入。

没有办法从内核获取非清零页面，出于安全原因，如果您考虑将内核和其他用户内存页面的内容泄漏到某些用户空间进程的后果，这一点应该是显而易见的。（对于某些嵌入式系统用例，Linux 有一个

MAP_UNINITIALIZED

 标志，但在正常构建中禁用了对其的内核支持。）

如何避免缓存内存

您可以在 init 循环中使用

_mm_stream_ps

 NT 存储。如果数据以前很热，那么它会绕过缓存并逐出。 （在 Pentium-M 或更早的 IIRC 上不保证驱逐，但在所有 x86-64 上都可以。）

或调零后使用

_mm_clflushopt

。

使用

perf stat

或

time

，您可能会发现您的程序在内核中花费了一半以上的时间，

就像您最近的另一个问题，在我的系统上，它也花费了一半以上的时间在 sys

而不是

usr

。 （这适用于整个程序，而不仅仅是其定时区域，因此本身不是错误，而是一种查看其中某些情况的方法。）

脚注 1： 除非您使用 MAP_POPULATE

，但 glibc malloc 不会，因为除其他原因外，能够进行大量分配并且只触及您需要的部分很方便。大多数手动使用

mmap

 的程序都会做出相同的选择。

对于足够小的问题，简单的流循环显示出比 DRAM B/W 更高的有效 B/W

问题描述投票：0回答：1

1个回答

最新问题

对于足够小的问题，简单的流循环显示出比 DRAM B/W 更高的有效 B/W

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1