如何解释并修复使用 NVIDIA nvc 编译器执行 OpenACC 程序时出现的错误?

问题描述 投票:0回答:1

我将带有 OpenMP pragma 指令的并行 C 程序翻译为带有 OpenACC pragma 指令的 C 程序,通过删除原始 OpenMP pragma 并在我认为可以并行化的地方添加 OpenACC 指令,在 GPU 上运行它。没有 OpenMP 或 OpenACC pragma 指令的串行版本工作得很好,OpenMP 版本也是如此。当使用与其他两个版本相同的

nvc
编译器执行 OpenACC 版本时,我遇到了一个奇怪的错误。

它是一个大型机器学习程序,所以我只提及我添加 OpenACC pragma 指令的功能:
功能1:

void function1(double const *x, double *const *W, double *D, int num_out, int num_features, double alpha, int R)
{
    int j, k;

#pragma acc kernels loop
    for (j = 0; j < num_out; j++)
    {
        double sum = 0;
        for (k = 0; k < num_features; k++)
            sum += (W[j][k] - x[k]) * (W[j][k] - x[k]);
        D[j] = sum;
    }

    double d_min = INFINITY;
    int d_min_idx = -1;

#pragma acc parallel loop reduction(min : d_min)
    for (j = 0; j < num_out; j++)
    {
        if (D[j] < d_min)
        {
            d_min = D[j];
            d_min_idx = j;
        }
    }

    int from_node = max(0, d_min_idx - R);
    int to_node = min(num_out, d_min_idx + R + 1);

#pragma acc kernels
    for (j = from_node; j < to_node; j++)
    {
        for (k = 0; k < num_features; k++)
            W[j][k] += alpha * (x[k] - W[j][k]);
    }
}

功能2:

void function2(double **X, double *const *W, int num_samples, int num_features, int num_out, double alpha_min)
{
    int R = num_out >> 2, iter = 0;
    double alpha = 1.f;
    double *D = (double *)malloc(num_out * sizeof(double));

#pragma acc data copyin(X[0 : num_samples][0 : num_features], W[0 : num_out][0 : num_features]) create(D[0 : num_out])
    {
        for (; alpha > alpha_min; alpha -= 0.01, iter++)
        {
            for (int sample = 0; sample < num_samples; sample++)
            {
                const double *x = X[sample];
                kohonen_update_weights(x, W, D, num_out, num_features, alpha, R);
            }
            if (iter % 10 == 0 && R > 1)
                R--;
        }
    }
    free(D);
}

要编译该程序的串行版本,无需 OpenACC pragma 指令,我使用以下命令:

$ nvc -Minfo=all -o program program.c

并得到以下结果,符合预期:

Test 1 completed in 0.01334 sec
Test 2 completed in 0.006111 sec
Test 3 completed in 0.003211 sec
(Note: Calculated times include: creating test sets, training model and writing files to disk.)

当我编译和执行带有

-acc
标志的 OpenACC 版本时,我预计会得到类似的结果。 然而

当我使用以下命令使用 OpenACC pragma 指令编译程序时:

$ nvc -acc=gpu -gpu=cc89 -Minfo=all -o program program.c

并运行可执行文件,我收到以下错误:

Accelerator Fatal Error: This file was compiled: -acc=gpu -gpu=cc35 -gpu=cc50 -gpu=cc60 -gpu=cc60 -gpu=cc70 -gpu=cc75 -gpu=cc80 -
Rebuild this file with -gpu=cc89 to use NVIDIA Tesla GPU 0
 File: /proj/build/23C/Linux_x86_64/rte/accel-uni/build/Linux_x86_64/../../src/cuda_fill.c
 Function: __pgi_uacc_cuda_fill:98
 Line: 44

一些上下文信息(如果可能有帮助):

$ nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4070        Off | 00000000:01:00.0  On |                  N/A |
|  0%   29C    P8               5W / 200W |    340MiB / 12282MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1551      G   /usr/lib/xorg/Xorg                          106MiB |
|    0   N/A  N/A      1804    C+G   ...libexec/gnome-remote-desktop-daemon      154MiB |
|    0   N/A  N/A      1899      G   /usr/bin/gnome-shell                         65MiB |
+---------------------------------------------------------------------------------------+

nvc --version

nvc 24.1-0 64-bit target on x86-64 Linux -tp znver4

nvaccelinfo

CUDA Driver Version:           12020
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  535.171.04  Tue Mar 19 20:30:00 UTC 2024

Device Number:                 0
Device Name:                   NVIDIA GeForce RTX 4070
Device Revision Number:        8.9
Global Memory Size:            12568887296
Number of Multiprocessors:     46
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    2520 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             10501 MHz
Memory Bus Width:              192 bits
L2 Cache Size:                 37748736 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Managed Memory:                Yes
Concurrent Managed Memory:     Yes
Preemption Supported:          Yes
Cooperative Launch:            Yes
Default Target:                cc89

编译命令中

-Minfo=all
参数的结果:

     65, Loop carried dependence of D-> prevents parallelization
         Loop carried backward dependence of D-> prevents vectorization
         Complex loop carried dependence of x->,W->-> prevents parallelization
         Generating NVIDIA GPU code
         65, #pragma acc loop seq
         68, #pragma acc loop vector(128) /* threadIdx.x */
             Generating implicit reduction(+:sum)
     65, Generating implicit copyout(D[:num_out]) [if not already present]
         Generating implicit copyin(W[:num_out][:num_features],x[:num_features]) [if not already present]
     68, Loop is parallelizable
     77, Generating implicit firstprivate(num_out,d_min_idx)
         Generating NVIDIA GPU code
         77, #pragma acc loop seq
             Generating reduction(min:d_min)
     77, Generating implicit copy(d_min) [if not already present]
         Generating implicit copyin(D[:num_out]) [if not already present]
     82, Accelerator restriction: induction variable live-out from loop: d_min_idx
     90, Complex loop carried dependence of x->,W->-> prevents parallelization
         Accelerator serial kernel generated
         Generating NVIDIA GPU code
         90, #pragma acc loop seq
         92, #pragma acc loop seq
     90, Generating implicit copyin(x[:num_features]) [if not already present]
         Generating implicit copy(W[.I0000:min(num_out,(R+d_min_idx)+1)-.I0000][:num_features]) [if not already present]
     92, Complex loop carried dependence of x->,W->-> prevents parallelization
kohonen_som_tracer:
    104, Generating copyin(X[:num_samples][:num_features],W[:num_out][:num_features]) [if not already present]
         Generating create(D[:num_out]) [if not already present]

编辑: 在编译命令中添加了

-Minfo=all
参数的结果,以防有相关信息。

我该如何解决这个错误?有什么我遗漏的或者应该参考的吗?

预先感谢您的帮助!

parallel-processing openmp nvidia hpc openacc
1个回答
0
投票

虽然这可能无法修复您看到的运行时错误,但我已经更新了您的代码,因此它将并行化循环。您需要将“independent”添加到“kernels”区域以告诉编译器忽略依赖关系。由于 C 允许相同类型的指针为相同数据起别名,因此编译器必须假设它们确实如此。因此,它无法自动并行化循环。

也如所写,“d_min_idx”可能会给出不正确的结果。不幸的是,您可以在同一个并行循环中找到最小值及其索引。因此需要将其分成两个循环。我将其设置为查找与偶数多个索引中的最小值匹配的第一个索引具有相同的值。

 #include <math.h>

void function1(double const *x, double *const *W, double *D, int num_out, int num_features, double alpha, int R)
{
    int j, k;

#pragma acc kernels loop independent
    for (j = 0; j < num_out; j++)
    {
        double sum = 0;
#pragma acc loop reduction(+:sum)
        for (k = 0; k < num_features; k++)
            sum += (W[j][k] - x[k]) * (W[j][k] - x[k]);
        D[j] = sum;
    }

    double d_min = INFINITY;
    int d_min_idx = -1;

#pragma acc parallel loop reduction(min : d_min)
    for (j = 0; j < num_out; j++)
    {
        if (D[j] < d_min)
        {
            d_min = D[j];
        }
    }
#pragma acc parallel loop reduction(min : d_min_idx)
    for (j = 0; j < num_out; j++)
    {
        if (D[j] == d_min)
        {
            d_min_idx = j;
        }
    }

    int from_node = max(0, d_min_idx - R);
    int to_node = min(num_out, d_min_idx + R + 1);

#pragma acc kernels loop collapse(2) independent
    for (j = from_node; j < to_node; j++)
    {
        for (k = 0; k < num_features; k++)
            W[j][k] += alpha * (x[k] - W[j][k]);
    }
}

编译器反馈消息:

% nvc -c -w -acc -Minfo=accel test.c
function1:
      8, Loop is parallelizable
         Generating NVIDIA GPU code
          8, #pragma acc loop gang /* blockIdx.x */
         12, #pragma acc loop vector(128) /* threadIdx.x */
             Generating reduction(+:sum)
      8, Generating implicit copyout(D[:num_out]) [if not already present]
         Generating implicit copyin(W[:num_out][:num_features],x[:num_features]) [if not already present]
     12, Loop is parallelizable
     21, Generating implicit firstprivate(num_out)
         Generating NVIDIA GPU code
         21, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             Generating reduction(min:d_min)
     21, Generating implicit copy(d_min) [if not already present]
         Generating implicit copyin(D[:num_out]) [if not already present]
     29, Generating implicit firstprivate(d_min,num_out)
         Generating NVIDIA GPU code
         29, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             Generating reduction(min:d_min_idx)
     29, Generating implicit copy(d_min_idx) [if not already present]
         Generating implicit copyin(D[:num_out]) [if not already present]
     41, Loop is parallelizable
         Generating implicit copyin(x[:num_features]) [if not already present]
         Generating implicit copy(W[from_node:to_node-from_node][:num_features]) [if not already present]
     43, Loop is parallelizable
         Generating NVIDIA GPU code
         41, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
         43,   /* blockIdx.x threadIdx.x collapsed */
© www.soinside.com 2019 - 2024. All rights reserved.