带有逐列寻址顺序的有问题的向量化 (C)

问题描述 投票:0回答:0

出于某种原因,具有寻址顺序by columns的代码是vectorized。但是看了编译器的解释后,不清楚究竟是什么被向量化了

列顺序示例

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

#define s_parameter     6
#define NMMax_Si        30000000

double* p_M[s_parameter];

void Inter(){
   long int k, s, t;
   double VR, VRR;
   double VRC[3];

   s = rand();
   t = rand();

   for (k = 0; k < 3; k++) { VRC[k] = p_M[k][s] - p_M[k][t]; }
   VRR = VRC[0] * VRC[0] + VRC[1] * VRC[1] + VRC[2] * VRC[2];
   VR = sqrt(VRR);

   printf ("%f", VR);
}

int main()
{
   int i;
   for (i = 0; i<s_parameter; i++) p_M[i] = (double*)aligned_alloc(64, NMMax_Si * sizeof(double));
   Inter();
   return 0;
}

编译后使用

gcc -g -lm -Wall -Wno-unused-but-set-variable -std=c17 -fopenmp -march=native -O3 -mavx2 -ftree-vectorize -fopt-info-vec-all main2.c

我得到:

**src/main2.c:21:18: optimized: loop vectorized using 16 byte vectors**
src/main2.c:13:6: note: vectorized 1 loops in function.
src/main2.c:18:8: missed: statement clobbers memory: _1 = rand ();
src/main2.c:19:8: missed: statement clobbers memory: _2 = rand ();
src/main2.c:21:45: missed: statement clobbers memory: vect__7.13_58 = __builtin_ia32_gatherdiv2df ({ 0.0, 0.0 }, _54, vect_57, {  Nan,  Nan }, 1);
src/main2.c:21:57: missed: statement clobbers memory: vect__11.14_63 = __builtin_ia32_gatherdiv2df ({ 0.0, 0.0 }, _59, vect_57, {  Nan,  Nan }, 1);
src/main2.c:23:9: missed: statement clobbers memory: VR_34 = sqrt (VRR_25);
src/main2.c:25:4: missed: statement clobbers memory: printf ("%f", VR_33);

1。如果使用逐列寻址顺序,到底什么被矢量化了? 下面的行顺序示例具有几乎相同的输出,但没有 missed: statement clobbers memory at the loop 21.

行顺序示例

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

#define s_parameter     6
#define NMMax_Si        30000000

double* p_M[NMMax_Si];

void Inter(){
   long int k, s, t;
   double VR, VRR;
   double VRC[3];

   s = rand();
   t = rand();

   for (k = 0; k < 3; k++) { VRC[k] = p_M[s][k] - p_M[t][k]; }
   VRR = VRC[0] * VRC[0] + VRC[1] * VRC[1] + VRC[2] * VRC[2];
   VR = sqrt(VRR);

   printf ("%f", VR);
}

int main()
{
   int i;
   for (i = 0; i<NMMax_Si; i++) p_M[i] = (double*)aligned_alloc(64, s_parameter * sizeof(double));
   Inter();
   return 0;
}

带输出:

src/main.c:21:18: optimized: loop vectorized using 16 byte vectors
src/main.c:13:6: note: vectorized 1 loops in function.
src/main.c:18:8: missed: statement clobbers memory: _1 = rand ();
src/main.c:19:8: missed: statement clobbers memory: _2 = rand ();
src/main.c:23:9: missed: statement clobbers memory: VR_35 = sqrt (VRR_26);
src/main.c:25:4: missed: statement clobbers memory: printf ("%f", VR_34);

2。行顺序方法在向量化时有不同的结果吗?

3。有什么方法可以向量化所有计算以确定 VR 的最终值吗?

   for (k = 0; k < 3; k++) { VRC[k] = p_M[s][k] - p_M[t][k]; }
   VRR = VRC[0] * VRC[0] + VRC[1] * VRC[1] + VRC[2] * VRC[2];
   VR = sqrt(VRR);

4。额外的零数据(填充)是否有助于改善情况?

  for (k = 0; k < 4; k++) { VRC[k] = p_M[s][k] - p_M[t][k]; }
  // p_M[:][3] == 0
       VRR = VRC[0] * VRC[0] + VRC[1] * VRC[1] + VRC[2] * VRC[2] + VRC[3] * VRC[3];
       VR = sqrt(VRR);
c vectorization compiler-optimization column-oriented
© www.soinside.com 2019 - 2024. All rights reserved.