clflush通过C函数使缓存行无效

Question

我试图使用clflush手动驱逐缓存行，以确定缓存和行大小。我没有找到任何关于如何使用该指令的指南。我所看到的，是一些使用更高级别功能的代码。

有一个内核函数void clflush_cache_range(void *vaddr, unsigned int size)，但我仍然不知道在我的代码中包含什么以及如何使用它。我不知道那个函数中的size是什么。

更重要的是，我怎样才能确定该行被驱逐以验证我的代码的正确性？

更新：

这是我想要做的初始代码。

#include <immintrin.h>
#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>
int main()
{
  int array[ 100 ];
  /* will bring array in the cache */
  for ( int i = 0; i < 100; i++ )
    array[ i ] = i;

  /* FLUSH A LINE */
  /* each element is 4 bytes */
  /* assuming that cache line size is 64 bytes */
  /* array[0] till array[15] is flushed */
  /* even if line size is less than 64 bytes */
  /* we are sure that array[0] has been flushed */
  _mm_clflush( &array[ 0 ] );



  int tm = 0;
  register uint64_t time1, time2, time3;


  time1 = __rdtscp( &tm ); /* set timer */
  time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache miss */
  printf( "miss latency = %lu \n", time2 );

  time3 = __rdtscp( &array[ 0 ] ) - time2; /* array[0] is a cache hit */
  printf( "hit latency = %lu \n", time3 );
  return 0;
}

在运行代码之前，我想手动验证它是否是正确的代码。我在正确的道路上吗？我是否正确使用_mm_clflush？

更新：

感谢Peter的评论，我将代码修改如下

  time1 = __rdtscp( &tm ); /* set timer */
  time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache miss */
  printf( "miss latency = %lu \n", time2 );
  time1 = __rdtscp( &tm ); /* set timer */
  time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache hit */
  printf( "hit latency = %lu \n", time1 );

通过多次运行代码，我得到以下输出

$ ./flush
miss latency = 238
hit latency = 168
$ ./flush
miss latency = 154
hit latency = 140
$ ./flush
miss latency = 252
hit latency = 140
$ ./flush
miss latency = 266
hit latency = 252

第一轮似乎是合理的。但第二轮看起来很奇怪。通过从命令行运行代码，每次使用值初始化数组，然后我明确地逐出第一行。

UPDATE4：

我尝试了Hadi-Brais代码，这里是输出

naderan@webshub:~$ ./flush3
address = 0x7ffec7a92220
array[ 0 ] = 0
miss section latency = 378
array[ 0 ] = 0
hit section latency = 175
overhead latency = 161
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 217 TSC cycles
naderan@webshub:~$ ./flush3
address = 0x7ffedbe0af40
array[ 0 ] = 0
miss section latency = 392
array[ 0 ] = 0
hit section latency = 231
overhead latency = 168
Measured L1 hit latency = 63 TSC cycles
Measured main memory latency = 224 TSC cycles
naderan@webshub:~$ ./flush3
address = 0x7ffead7fdc90
array[ 0 ] = 0
miss section latency = 399
array[ 0 ] = 0
hit section latency = 161
overhead latency = 147
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 252 TSC cycles
naderan@webshub:~$ ./flush3
address = 0x7ffe51a77310
array[ 0 ] = 0
miss section latency = 364
array[ 0 ] = 0
hit section latency = 182
overhead latency = 161
Measured L1 hit latency = 21 TSC cycles
Measured main memory latency = 203 TSC cycles

稍微不同的延迟是可以接受的。然而，与21和14相比，命中延迟为63也是可观察到的。

UPDATE5：

当我查看Ubuntu时，没有启用节电功能。也许在BIOS中禁用频率更改，或者存在未命中配置

$ cat /proc/cpuinfo  | grep -E "(model|MHz)"
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
cpu MHz         : 2097.571
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz  
cpu MHz         : 2097.571
$ lscpu | grep MHz
CPU MHz:             2097.571

无论如何，这意味着频率被设置为其最大值，这是我必须关心的。通过多次运行，我看到了一些不同的值。这些正常吗？

$ taskset -c 0 ./flush3
address = 0x7ffe30c57dd0
array[ 0 ] = 0
miss section latency = 602
array[ 0 ] = 0
hit section latency = 161
overhead latency = 147
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 455 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffd16932fd0
array[ 0 ] = 0
miss section latency = 399
array[ 0 ] = 0
hit section latency = 168
overhead latency = 147
Measured L1 hit latency = 21 TSC cycles
Measured main memory latency = 252 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffeafb96580
array[ 0 ] = 0
miss section latency = 364
array[ 0 ] = 0
hit section latency = 161
overhead latency = 140
Measured L1 hit latency = 21 TSC cycles
Measured main memory latency = 224 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffe58291de0
array[ 0 ] = 0
miss section latency = 357
array[ 0 ] = 0
hit section latency = 168
overhead latency = 140
Measured L1 hit latency = 28 TSC cycles
Measured main memory latency = 217 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7fffa76d20b0
array[ 0 ] = 0
miss section latency = 371
array[ 0 ] = 0
hit section latency = 161
overhead latency = 147
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 224 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffdec791580
array[ 0 ] = 0
miss section latency = 357
array[ 0 ] = 0
hit section latency = 189
overhead latency = 147
Measured L1 hit latency = 42 TSC cycles
Measured main memory latency = 210 TSC cycles

Answer 1

您在代码中有多个错误，可能会导致您看到的无意义的测量结果。我已修复错误，您可以在下面的评论中找到解释。

/* compile with gcc at optimization level -O3 */
/* set the minimum and maximum CPU frequency for all cores using cpupower to get meaningful results */ 
/* run using "sudo nice -n -20 ./a.out" to minimize possible context switches, or at least use "taskset -c 0 ./a.out" */
/* you can optionally use a p-state scaling driver other than intel_pstate to get more reproducable results */
/* This code still needs improvement to obtain more accurate measurements,
   and a lot of effort is required to do that—argh! */
/* Specifically, there is no single constant latency for the L1 because of
   the way it's designed, and more so for main memory. */
/* Things such as virtual addresses, physical addresses, TLB contents,
   code addresses, and interrupts may have an impact that needs to be
   investigated */
/* The instructions that GCC puts unnecessarily in the timed section are annoying AF */
/* This code is written to run on Intel processors! */

#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>
int main()
{
  int array[ 100 ];

  /* this is optional */
  /* will bring array in the cache */
  for ( int i = 0; i < 100; i++ )
    array[ i ] = i;

  printf( "address = %p \n", &array[ 0 ] ); /* guaranteed to be aligned within a single cache line */

  _mm_mfence();                      /* prevent clflush from being reordered by the CPU or the compiler in this direction */

  /* flush the line containing the element */
  _mm_clflush( &array[ 0 ] );

  //unsigned int aux;
  uint64_t time1, time2, msl, hsl, osl; /* initial values don't matter */

  /* rdtscp is not suitbale for measuing very small sections of code because
   the write to its parameter occurs after sampling the TSC and it impacts 
   compiler optimizations and code gen, thereby perturbing the measurement */

  _mm_mfence();                      /* this properly orders both clflush and rdtscp*/
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtscp */
  time1 = __rdtsc();                 /* set timer */
  _mm_lfence();                      /* serialize __rdtscp with respect to trailing instructions + compiler barrier for rdtscp and the load */
  int temp = array[ 0 ];             /* array[0] is a cache miss */
  /* measring the write miss latency to array is not meaningful because it's an implementation detail and the next write may also miss */
  /* no need for mfence because there are no stores in between */
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtscp and the load*/
  time2 = __rdtsc();
  _mm_lfence();                      /* serialize __rdtscp with respect to trailing instructions */
  msl = time2 - time1;

  printf( "array[ 0 ] = %i \n", temp );             /* prevent the compiler from optimizing the load */
  printf( "miss section latency = %lu \n", msl );   /* the latency of everything in between the two rdtscp */

  _mm_mfence();                      /* this properly orders both clflush and rdtscp*/
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtscp */
  time1 = __rdtsc();                 /* set timer */
  _mm_lfence();                      /* serialize __rdtscp with respect to trailing instructions + compiler barrier for rdtscp and the load */
  temp = array[ 0 ];                 /* array[0] is a cache hit as long as the OS, a hardware prefetcher, or a speculative accesses to the L1D or lower level inclusive caches don't evict it */
  /* measring the write miss latency to array is not meaningful because it's an implementation detail and the next write may also miss */
  /* no need for mfence because there are no stores in between */
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtscp and the load */
  time2 = __rdtsc();
  _mm_lfence();                      /* serialize __rdtscp with respect to trailing instructions */
  hsl = time2 - time1;

  printf( "array[ 0 ] = %i \n", temp );            /* prevent the compiler from optimizing the load */
  printf( "hit section latency = %lu \n", hsl );   /* the latency of everything in between the two rdtscp */


  _mm_mfence();                      /* this properly orders both clflush and rdtscp*/
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtscp */
  time1 = __rdtsc();                 /* set timer */
  _mm_lfence();                      /* serialize __rdtscp with respect to trailing instructions + compiler barrier for rdtscp */
  /* no need for mfence because there are no stores in between */
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtscp */
  time2 = __rdtsc();
  _mm_lfence();                      /* serialize __rdtscp with respect to trailing instructions */
  osl = time2 - time1;

  printf( "overhead latency = %lu \n", osl ); /* the latency of everything in between the two rdtscp */


  printf( "Measured L1 hit latency = %lu TSC cycles\n", hsl - osl ); /* hsl is always larger than osl */
  printf( "Measured main memory latency = %lu TSC cycles\n", msl - osl ); /* msl is always larger than osl and hsl */

  return 0;
}

强烈推荐：Memory latency measurement with time stamp counter。

相关：How can I create a spectre gadget in practice?。

Answer 2

你知道你可以用cpuid查询线条大小，对吗？如果你真的想以编程方式找到它，那就这样做。（否则，假设它是64字节，因为它在PIII之后的所有内容上。）

但无论出于何种原因，如果想使用C中的clflush或clflushopt，请使用来自void _mm_clflush(void const *p)的void _mm_clflushopt(void const *p)或#include <immintrin.h>。（见Intel's insn set ref manual entry for clflush或clflushopt）。

GCC，clang，ICC和MSVC都支持英特尔的<immintrin.h>内在函数。

您也可以通过searching Intel's intrinsics guide for clflush找到它来查找该指令的内在函数的定义。

另请参阅https://stackoverflow.com/tags/x86/info以获取指南，文档和参考手册的更多链接。

更重要的是，我怎样才能确定该行被驱逐以验证我的代码的正确性？

查看编译器的asm输出，或者在调试器中单步执行。如果/当clflush执行时，该缓存行将在程序中的那一点被逐出。

clflush通过C函数使缓存行无效

问题描述投票：5回答：2

2个回答

最新问题

clflush通过C函数使缓存行无效

问题描述 投票：5回答：2

2个回答

最新问题

问题描述投票：5回答：2