为什么错误共享仍会影响非原子，但比原子少得多？

Question

考虑以下示例，证明假共享存在：

using type = std::atomic<std::int64_t>;

struct alignas(128) shared_t
{
  type  a;
  type  b;
} sh;

struct not_shared_t
{
  alignas(128) type a;
  alignas(128) type b;
} not_sh;

一个线程将a递增1，另一个线程将b递增。即使未使用结果，增量也会使用MSVC编译为lock xadd。

对于a和b分开的结构，在几秒钟内累积的值对于not_shared_t比shared_t大大约十倍。

[到目前为止的预期结果：单独的缓存行在L1d缓存中保持高温，增加lock xadd吞吐量的瓶颈，错误共享是对缓存行造成的性能灾难。（编辑者的注：启用优化后，更高版本的MSVC使用lock inc。这可能会拉大竞争与未竞争之间的差距。）

现在我将using type = std::atomic<std::int64_t>;替换为普通的std::int64_t

（（非原子增量编译为inc QWORD PTR [rcx]。循环中的原子负载恰巧阻止了编译器仅将计数器保存在寄存器中，直到循环退出。）

not_shared_t的到达计数仍大于shared_t的计数，但现在少于两次。

|          type is          | variables are |      a=     |      b=     |
|---------------------------|---------------|-------------|-------------|
| std::atomic<std::int64_t> |    shared     |   59’052’951|   59’052’951|
| std::atomic<std::int64_t> |  not_shared   |  417’814’523|  416’544’755|
|       std::int64_t        |    shared     |  949’827’195|  917’110’420|
|       std::int64_t        |  not_shared   |1’440’054’733|1’439’309’339|

为什么非原子情况在性能上如此接近？

以下是程序的其余部分，以完成最小的可重现示例。（也On Godbolt with MSVC，准备编译/运行）

std::atomic<bool> start, stop;

void thd(type* var)
{
  while (!start) ;
  while (!stop) (*var)++;
}

int main()
{
  std::thread threads[] = {
     std::thread( thd, &sh.a ),     std::thread( thd, &sh.b ),
     std::thread( thd, &not_sh.a ), std::thread( thd, &not_sh.b ),
  };

  start.store(true);

  std::this_thread::sleep_for(std::chrono::seconds(2));

  stop.store(true);
  for (auto& thd : threads) thd.join();

  std::cout
    << " shared: "    << sh.a     << ' ' << sh.b     << '\n'
    << "not shared: " << not_sh.a << ' ' << not_sh.b << '\n';
}

Answer 1

非原子内存增量可以在重新加载自己的存储值时受益于存储转发。即使高速缓存行无效，也可能发生这种情况。内核知道存储将最终发生，并且内存排序规则使该内核可以在全局可见之前看到自己的存储。

存储转发为您提供了停顿前存储缓冲区的长度增量，而不是needing exclusive access to the cache line to do an atomic RMW increment。

当此核心最终最终获得缓存行的所有权时，它可以在1 / clock提交多个存储。这比由内存目标增量创建的依赖关系链快6倍：〜5个周期的存储/重载延迟+ 1个周期的ALU延迟。因此，在非原子情况下，执行只会以核心拥有该资源时可以消耗的速度将新商店放入SB的1/6，这就是为什么共享与非存储之间没有巨大差距的原因-共享原子。

当然也将清除一些内存订购机；和/或SB满是在错误共享情况下降低吞吐量的可能原因。请参阅What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?的答案和评论，以进行类似此实验的另一项实验。

A lock inc或lock xadd强制存储缓冲区在操作之前耗尽，并且包括作为操作的一部分提交到L1d高速缓存。这使商店转发变得不可能，并且仅当高速缓存行处于互斥或已修改MESI状态时才会发生。

为什么错误共享仍会影响非原子，但比原子少得多？

问题描述投票：1回答：1

1个回答

最新问题

为什么错误共享仍会影响非原子，但比原子少得多？

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1