存储器延迟的测量与时间戳计数器

问题描述 投票:3回答:2

我已经写下面的代码,其第一刷新2个数组元素,然后尝试在为了测量命中/未命中的等待时间来读出元件。

#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>
#include <time.h>
int main()
{
    /* create array */
    int array[ 100 ];
    int i;
    for ( i = 0; i < 100; i++ )
        array[ i ] = i;   // bring array to the cache

    uint64_t t1, t2, ov, diff1, diff2, diff3;

    /* flush the first cache line */
    _mm_lfence();
    _mm_clflush( &array[ 30 ] );
    _mm_clflush( &array[ 70 ] );
    _mm_lfence();

    /* READ MISS 1 */
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    int tmp = array[ 30 ];   // read the first elemet => cache miss
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();

    diff1 = t2 - t1;        // two fence statements are overhead
    printf( "tmp is %d\ndiff1 is %lu\n", tmp, diff1 );

    /* READ MISS 2 */
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    tmp = array[ 70 ];      // read the second elemet => cache miss (or hit due to prefetching?!)
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();

    diff2 = t2 - t1;        // two fence statements are overhead
    printf( "tmp is %d\ndiff2 is %lu\n", tmp, diff2 );


    /* READ HIT*/
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    tmp = array[ 30 ];   // read the first elemet => cache hit
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();

    diff3 = t2 - t1;        // two fence statements are overhead
    printf( "tmp is %d\ndiff3 is %lu\n", tmp, diff3 );


    /* measuring fence overhead */
    _mm_lfence();
    t1 = __rdtsc();
    _mm_lfence();
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();
    ov = t2 - t1;

    printf( "lfence overhead is %lu\n", ov );
    printf( "cache miss1 TSC is %lu\n", diff1-ov );
    printf( "cache miss2 (or hit due to prefetching) TSC is %lu\n", diff2-ov );
    printf( "cache hit TSC is %lu\n", diff3-ov );


    return 0;
}

和输出

# gcc -O3 -o simple_flush simple_flush.c
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 529
tmp is 70
diff2 is 222
tmp is 30
diff3 is 46
lfence overhead is 32
cache miss1 TSC is 497
cache miss2 (or hit due to prefetching) TSC is 190
cache hit TSC is 14
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 486
tmp is 70
diff2 is 276
tmp is 30
diff3 is 46
lfence overhead is 32
cache miss1 TSC is 454
cache miss2 (or hit due to prefetching) TSC is 244
cache hit TSC is 14
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 848
tmp is 70
diff2 is 222
tmp is 30
diff3 is 46
lfence overhead is 34
cache miss1 TSC is 814
cache miss2 (or hit due to prefetching) TSC is 188
cache hit TSC is 12

有一些问题与输出读取array[70]。该TSC既不打也不错过。我已经刷新类似array[30]该项目。一种可能性是,当访问array[40],硬件预取带来array[70]。所以,这应该是一个打击。但是,TSC比命中多。您可以验证命中TSC大约是20,当我尝试读取array[30]第二次。

甚至,如果array[70]不预取时,TSC应类似于高速缓存未命中。

有什么理由是什么?

UPDATE1:

为了使一个数组读,我试图通过(void) *((int*)array+i)彼得和哈迪的建议。

在输出我看到很多负面的结果。我的意思是开销似乎比(void) *((int*)array+i)

UPDATE2:

我忘了补充volatile。现在的结果是有意义的。

c performance x86 cpu-architecture tsc
2个回答
3
投票

首先,请注意,这两个调用测量printfdiff1diff2可能扰乱L1D的状态,甚至L2。在我的系统,具有printf,为4-48周期之间diff3-ov范围所报告的值(I已经配置我的系统,以使TSC频率大约等于所述芯频率)。最常用的值是那些L2和L3潜伏期。如果报告的值是8,那么我们就有了L1D缓存命中。如果它是大于8时,则极有可能到printf前述呼叫踢出从L1D目标缓存线以及可能的L2(和在某些罕见的情况下,L3!),这可以解释所测量的等待时间是更高大于8 @PeterCordes有suggested改用(void) *((volatile int*)array + i)temp = array[i]; printf(temp)。进行此更改后,我的实验表明,diff3-ov大多数报告的测量是完全相同的8个周期(这表明测量误差约为4个周期),并要报告的唯一其他值都为0,4,和12。所以,彼得的做法强烈推荐。

一般地,主存储器存取等待时间取决于许多因素,包括MMU缓存的状态和页面表步行者对数据高速缓存的影响,核心频率,非核频率,状态和存储器控制器的配置和相对于所述目标物理地址,非核争用,和对核争用存储器芯片由于超线程。 array[70]可能是在不同的虚拟页(物理页)比array[30]和他们的加载指令的IP和目标存储单元的地址可以用复杂的方式预取交互。所以可以有很多原因,cache miss1cache miss2不同。彻底的调查是可能的,但它需要很多的努力,你可能想象。一般来说,如果核心频率大于1.5千兆赫(比上高PERF Intel处理器的TSC frequency更小),那么L3加载未命中将至少需要60个芯周期。在你的情况下,两个小姐等待时间超过100个循环,所以这些都是最有可能L3遗漏。在尽管一些极为罕见的情况下,cache miss2似乎接近L3或L2延迟范围,这是由于预取。


我已经确定,下面的代码给出的Haswell统计上更加精确的测量:

t1 = __rdtscp(&dummy);
tmp = *((volatile int*)array + 30);
asm volatile ("add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
          : "+r" (tmp));          
t2 = __rdtscp(&dummy);
t2 = __rdtscp(&dummy);
loadlatency = t2 - t1 - 60; // 60 is the overhead

loadlatency是4个周期的概率为97%。即loadlatency是8个周期的概率为1.7%。该loadlatency采取其他值的概率为1.3%。所有其他值都超过8分和4多较大,我会尝试在以后添加解释。


1
投票

一些想法:

  • 也许[70]被预取到高速缓存中的某种程度的,除了L1?
  • 或许在DRAM一些优化使这种访问要快,例如也许行缓冲区访问[30]后保持打开状态。

您应该调查,除了一个[30]和[70]其他访问看看你得到不同的数字。例如。你获得命中相同的定时上的[30]后跟一个[31](应该在同一行作为[30]中获取,如果使用aligned_alloc具有64字节对齐)。和做其它元素如[69]和[71]给出相同的定时,为[70]?

© www.soinside.com 2019 - 2024. All rights reserved.