如何解释并修复使用 NVIDIA nvc 编译器执行 OpenACC 程序时出现的错误?

我将带有 OpenMP pragma 指令的并行 C 程序翻译为带有 OpenACC pragma 指令的 C 程序,通过删除原始 OpenMP pragma 并在我认为可以并行化的地方添加 OpenACC 指令,在 GPU 上运行它。没有 OpenMP 或 OpenACC pragma 指令的串行版本工作得很好,OpenMP 版本也是如此。当使用与其他两个版本相同的

编译器执行 OpenACC 版本时,我遇到了一个奇怪的错误。

它是一个大型机器学习程序,所以我只提及我添加 OpenACC pragma 指令的功能:

void function1(double const *x, double *const *W, double *D, int num_out, int num_features, double alpha, int R)
    int j, k;

#pragma acc kernels loop
    for (j = 0; j < num_out; j++)
        double sum = 0;
        for (k = 0; k < num_features; k++)
            sum += (W[j][k] - x[k]) * (W[j][k] - x[k]);
        D[j] = sum;

    double d_min = INFINITY;
    int d_min_idx = -1;

#pragma acc parallel loop reduction(min : d_min)
    for (j = 0; j < num_out; j++)
        if (D[j] < d_min)
            d_min = D[j];
            d_min_idx = j;

    int from_node = max(0, d_min_idx - R);
    int to_node = min(num_out, d_min_idx + R + 1);

#pragma acc kernels
    for (j = from_node; j < to_node; j++)
        for (k = 0; k < num_features; k++)
            W[j][k] += alpha * (x[k] - W[j][k]);


void function2(double **X, double *const *W, int num_samples, int num_features, int num_out, double alpha_min)
    int R = num_out >> 2, iter = 0;
    double alpha = 1.f;
    double *D = (double *)malloc(num_out * sizeof(double));

#pragma acc data copyin(X[0 : num_samples][0 : num_features], W[0 : num_out][0 : num_features]) create(D[0 : num_out])
        for (; alpha > alpha_min; alpha -= 0.01, iter++)
            for (int sample = 0; sample < num_samples; sample++)
                const double *x = X[sample];
                kohonen_update_weights(x, W, D, num_out, num_features, alpha, R);
            if (iter % 10 == 0 && R > 1)

要编译该程序的串行版本,无需 OpenACC pragma 指令,我使用以下命令:

$ nvc -Minfo=all -o program program.c


Test 1 completed in 0.01334 sec
Test 2 completed in 0.006111 sec
Test 3 completed in 0.003211 sec
(Note: Calculated times include: creating test sets, training model and writing files to disk.)


标志的 OpenACC 版本时,我预计会得到类似的结果。 然而

当我使用以下命令使用 OpenACC pragma 指令编译程序时:

$ nvc -acc=gpu -gpu=cc89 -Minfo=all -o program program.c


Accelerator Fatal Error: This file was compiled: -acc=gpu -gpu=cc35 -gpu=cc50 -gpu=cc60 -gpu=cc60 -gpu=cc70 -gpu=cc75 -gpu=cc80 -
Rebuild this file with -gpu=cc89 to use NVIDIA Tesla GPU 0
 File: /proj/build/23C/Linux_x86_64/rte/accel-uni/build/Linux_x86_64/../../src/cuda_fill.c
 Function: __pgi_uacc_cuda_fill:98
 Line: 44


$ nvidia-smi

| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 4070        Off | 00000000:01:00.0  On |                  N/A |
|  0%   29C    P8               5W / 200W |    340MiB / 12282MiB |      0%      Default |
|                                         |                      |                  N/A |
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|    0   N/A  N/A      1551      G   /usr/lib/xorg/Xorg                          106MiB |
|    0   N/A  N/A      1804    C+G   ...libexec/gnome-remote-desktop-daemon      154MiB |
|    0   N/A  N/A      1899      G   /usr/bin/gnome-shell                         65MiB |

nvc --version

nvc 24.1-0 64-bit target on x86-64 Linux -tp znver4


CUDA Driver Version:           12020
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  535.171.04  Tue Mar 19 20:30:00 UTC 2024

Device Number:                 0
Device Name:                   NVIDIA GeForce RTX 4070
Device Revision Number:        8.9
Global Memory Size:            12568887296
Number of Multiprocessors:     46
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    2520 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             10501 MHz
Memory Bus Width:              192 bits
L2 Cache Size:                 37748736 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Managed Memory:                Yes
Concurrent Managed Memory:     Yes
Preemption Supported:          Yes
Cooperative Launch:            Yes
Default Target:                cc89



     65, Loop carried dependence of D-> prevents parallelization
         Loop carried backward dependence of D-> prevents vectorization
         Complex loop carried dependence of x->,W->-> prevents parallelization
         Generating NVIDIA GPU code
         65, #pragma acc loop seq
         68, #pragma acc loop vector(128) /* threadIdx.x */
             Generating implicit reduction(+:sum)
     65, Generating implicit copyout(D[:num_out]) [if not already present]
         Generating implicit copyin(W[:num_out][:num_features],x[:num_features]) [if not already present]
     68, Loop is parallelizable
     77, Generating implicit firstprivate(num_out,d_min_idx)
         Generating NVIDIA GPU code
         77, #pragma acc loop seq
             Generating reduction(min:d_min)
     77, Generating implicit copy(d_min) [if not already present]
         Generating implicit copyin(D[:num_out]) [if not already present]
     82, Accelerator restriction: induction variable live-out from loop: d_min_idx
     90, Complex loop carried dependence of x->,W->-> prevents parallelization
         Accelerator serial kernel generated
         Generating NVIDIA GPU code
         90, #pragma acc loop seq
         92, #pragma acc loop seq
     90, Generating implicit copyin(x[:num_features]) [if not already present]
         Generating implicit copy(W[.I0000:min(num_out,(R+d_min_idx)+1)-.I0000][:num_features]) [if not already present]
     92, Complex loop carried dependence of x->,W->-> prevents parallelization
    104, Generating copyin(X[:num_samples][:num_features],W[:num_out][:num_features]) [if not already present]
         Generating create(D[:num_out]) [if not already present]

虽然这可能无法修复您看到的运行时错误,但我已经更新了您的代码,因此它将并行化循环。您需要将“independent”添加到“kernels”区域以告诉编译器忽略依赖关系。由于 C 允许相同类型的指针为相同数据起别名,因此编译器必须假设它们确实如此。因此,它无法自动并行化循环。


 #include <math.h>

void function1(double const *x, double *const *W, double *D, int num_out, int num_features, double alpha, int R)
    int j, k;

#pragma acc kernels loop independent
    for (j = 0; j < num_out; j++)
        double sum = 0;
#pragma acc loop reduction(+:sum)
        for (k = 0; k < num_features; k++)
            sum += (W[j][k] - x[k]) * (W[j][k] - x[k]);
        D[j] = sum;

    double d_min = INFINITY;
    int d_min_idx = -1;

#pragma acc parallel loop reduction(min : d_min)
    for (j = 0; j < num_out; j++)
        if (D[j] < d_min)
            d_min = D[j];
#pragma acc parallel loop reduction(min : d_min_idx)
    for (j = 0; j < num_out; j++)
        if (D[j] == d_min)
            d_min_idx = j;

    int from_node = max(0, d_min_idx - R);
    int to_node = min(num_out, d_min_idx + R + 1);

#pragma acc kernels loop collapse(2) independent
    for (j = from_node; j < to_node; j++)
        for (k = 0; k < num_features; k++)
            W[j][k] += alpha * (x[k] - W[j][k]);


% nvc -c -w -acc -Minfo=accel test.c
      8, Loop is parallelizable
         Generating NVIDIA GPU code
          8, #pragma acc loop gang /* blockIdx.x */
         12, #pragma acc loop vector(128) /* threadIdx.x */
             Generating reduction(+:sum)
      8, Generating implicit copyout(D[:num_out]) [if not already present]
         Generating implicit copyin(W[:num_out][:num_features],x[:num_features]) [if not already present]
     12, Loop is parallelizable
     21, Generating implicit firstprivate(num_out)
         Generating NVIDIA GPU code
         21, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             Generating reduction(min:d_min)
     21, Generating implicit copy(d_min) [if not already present]
         Generating implicit copyin(D[:num_out]) [if not already present]
     29, Generating implicit firstprivate(d_min,num_out)
         Generating NVIDIA GPU code
         29, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             Generating reduction(min:d_min_idx)
     29, Generating implicit copy(d_min_idx) [if not already present]
         Generating implicit copyin(D[:num_out]) [if not already present]
     41, Loop is parallelizable
         Generating implicit copyin(x[:num_features]) [if not already present]
         Generating implicit copy(W[from_node:to_node-from_node][:num_features]) [if not already present]
     43, Loop is parallelizable
         Generating NVIDIA GPU code
         41, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
         43,   /* blockIdx.x threadIdx.x collapsed */
