安全的GPU编程

Question

我最近学习了如何在 C 语言中使用 OpenCL 对我的 AMD GPU 进行编程。但是，如果我给 GPU 的任务要求太高，我的整个系统就会停止正常工作，我必须重新启动。我正在使用 Linux（更具体地说，manjaro）。如何确保我的程序为其他应用程序留下足够的 GPU 功率？该代码只需在我的机器上运行即可。这是我当前的代码：

#define CL_TARGET_OPENCL_VERSION 300

#include <CL/cl.h> // Include OpenCL headers
#include <stdio.h>
#include <limits.h>

int main() {
    cl_device_id device;
    cl_context context;
    cl_command_queue queue;
    cl_program program;
    cl_kernel kernel;
    cl_mem buffer;
    // create data
    const int DATA_SIZE = 1000000;
    float data[DATA_SIZE];
    int count;
    for(count = 0; count < DATA_SIZE; count++) data[count] = count;


    // Setup OpenCL
    clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
    context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
    queue = clCreateCommandQueueWithProperties(context, device, NULL, NULL);

    // Define our kernel. It just calculates the sin of the input data.
    char *source = {
        "kernel void calcSin(global float *data) {\n"
        "   int id = get_global_id(0);\n"
        "   for (int i = 0; i < 400000; i++) {\n"
        "       data[id] = sin(data[id]);\n"
        "   }\n"
        "}\n"
    };

    // Compile the kernel
    program = clCreateProgramWithSource(context, 1, (const char**)&source, NULL, NULL);
    clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
    kernel = clCreateKernel(program, "calcSin", NULL);

    // Create the memory object
    if (context == NULL) {
        printf("context is null\n");
        return 0;
    } else {
        printf("context is not null\n");
    }

    buffer = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(cl_float) * DATA_SIZE, NULL, NULL);

    // Write data to the buffer
    clEnqueueWriteBuffer(queue, buffer, CL_TRUE, 0, sizeof(float) * DATA_SIZE, data, 0, NULL, NULL);

    // Execute the kernel
    const size_t LENGTH = DATA_SIZE;
    clSetKernelArg(kernel, 0, sizeof(buffer), &buffer);
    size_t global_dimensions[] = {LENGTH,0,0};
    clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_dimensions, NULL, 0, NULL, NULL);

    // Read back the results
    clEnqueueReadBuffer(queue, buffer, CL_FALSE, 0, sizeof(cl_float)*LENGTH, data, 0, NULL, NULL);

    // Wait for everything to finish
    clFinish(queue);

    // Print the result
    // printf("Array of integers:\n");
    // for (int i = 0; i < DATA_SIZE; i++) {
    //     printf("%.2f ", data[i]);
    // }
    // printf("\n");

    // Clean up
    clReleaseMemObject(buffer);
    clReleaseKernel(kernel);
    clReleaseProgram(program);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);

    return 0;
}

通过将每个核心执行的 sin() 操作量增加到 10^6 而不是 4*10^5，我的计算机需要重新启动。重新启动后，运行

journalctl -r -b -1 -p 3

显示两个错误，均以

apr 19 18:26:22 [my username] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*

开头。

我尝试通过

nice -n 19 ./path/to/file

以更高的niceness值运行程序。然而，这并没有解决问题。

Answer 1

问题在于，您的内核包含一个 400000 次迭代的漫长循环，这些迭代是在 GPU 的每个核心上以“串行”方式冗余计算的。您正在计算 x = (sin(sin(sin(sin(...sin(x)...)))), 400000 次，结果始终为 0，并且在每个 sin 之间覆盖中的值VRAM。用于并行 1M 元素。当然这会锁定你的系统。

GPU 并行化 = 将问题分解为尽可能多的单独线程。这意味着在您的情况下：每个 GPU 线程仅计算循环的一次迭代，即 x = sin(x)。从内核中删除循环。内核已经并行化超过 1M 个元素。

安全的GPU编程

问题描述投票：0回答：1

1个回答

最新问题

安全的GPU编程

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1