出于某种原因,具有寻址顺序by columns的代码是vectorized。但是看了编译器的解释后,不清楚究竟是什么被向量化了
列顺序示例
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#define s_parameter 6
#define NMMax_Si 30000000
double* p_M[s_parameter];
void Inter(){
long int k, s, t;
double VR, VRR;
double VRC[3];
s = rand();
t = rand();
for (k = 0; k < 3; k++) { VRC[k] = p_M[k][s] - p_M[k][t]; }
VRR = VRC[0] * VRC[0] + VRC[1] * VRC[1] + VRC[2] * VRC[2];
VR = sqrt(VRR);
printf ("%f", VR);
}
int main()
{
int i;
for (i = 0; i<s_parameter; i++) p_M[i] = (double*)aligned_alloc(64, NMMax_Si * sizeof(double));
Inter();
return 0;
}
编译后使用
gcc -g -lm -Wall -Wno-unused-but-set-variable -std=c17 -fopenmp -march=native -O3 -mavx2 -ftree-vectorize -fopt-info-vec-all main2.c
我得到:
**src/main2.c:21:18: optimized: loop vectorized using 16 byte vectors**
src/main2.c:13:6: note: vectorized 1 loops in function.
src/main2.c:18:8: missed: statement clobbers memory: _1 = rand ();
src/main2.c:19:8: missed: statement clobbers memory: _2 = rand ();
src/main2.c:21:45: missed: statement clobbers memory: vect__7.13_58 = __builtin_ia32_gatherdiv2df ({ 0.0, 0.0 }, _54, vect_57, { Nan, Nan }, 1);
src/main2.c:21:57: missed: statement clobbers memory: vect__11.14_63 = __builtin_ia32_gatherdiv2df ({ 0.0, 0.0 }, _59, vect_57, { Nan, Nan }, 1);
src/main2.c:23:9: missed: statement clobbers memory: VR_34 = sqrt (VRR_25);
src/main2.c:25:4: missed: statement clobbers memory: printf ("%f", VR_33);
1。如果使用逐列寻址顺序,到底什么被矢量化了? 下面的行顺序示例具有几乎相同的输出,但没有 missed: statement clobbers memory at the loop 21.
行顺序示例
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#define s_parameter 6
#define NMMax_Si 30000000
double* p_M[NMMax_Si];
void Inter(){
long int k, s, t;
double VR, VRR;
double VRC[3];
s = rand();
t = rand();
for (k = 0; k < 3; k++) { VRC[k] = p_M[s][k] - p_M[t][k]; }
VRR = VRC[0] * VRC[0] + VRC[1] * VRC[1] + VRC[2] * VRC[2];
VR = sqrt(VRR);
printf ("%f", VR);
}
int main()
{
int i;
for (i = 0; i<NMMax_Si; i++) p_M[i] = (double*)aligned_alloc(64, s_parameter * sizeof(double));
Inter();
return 0;
}
带输出:
src/main.c:21:18: optimized: loop vectorized using 16 byte vectors
src/main.c:13:6: note: vectorized 1 loops in function.
src/main.c:18:8: missed: statement clobbers memory: _1 = rand ();
src/main.c:19:8: missed: statement clobbers memory: _2 = rand ();
src/main.c:23:9: missed: statement clobbers memory: VR_35 = sqrt (VRR_26);
src/main.c:25:4: missed: statement clobbers memory: printf ("%f", VR_34);
2。行顺序方法在向量化时有不同的结果吗?
3。有什么方法可以向量化所有计算以确定 VR 的最终值吗?
for (k = 0; k < 3; k++) { VRC[k] = p_M[s][k] - p_M[t][k]; }
VRR = VRC[0] * VRC[0] + VRC[1] * VRC[1] + VRC[2] * VRC[2];
VR = sqrt(VRR);
4。额外的零数据(填充)是否有助于改善情况?
for (k = 0; k < 4; k++) { VRC[k] = p_M[s][k] - p_M[t][k]; }
// p_M[:][3] == 0
VRR = VRC[0] * VRC[0] + VRC[1] * VRC[1] + VRC[2] * VRC[2] + VRC[3] * VRC[3];
VR = sqrt(VRR);