显示队列、提交、开始、结束的时间函数如下:
void PrintProfilingInfo(cl_event event)
{
cl_int err_num = -1;
cl_ulong t_queued;
cl_ulong t_submitted;
cl_ulong t_started;
cl_ulong t_ended;
cl_ulong t_completed;
err_num = clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_QUEUED,
sizeof(cl_ulong), &t_queued, NULL);
// submit time
err_num = clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_SUBMIT,
sizeof(cl_ulong), &t_submitted, NULL);
// start time
err_num = clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START,
sizeof(cl_ulong), &t_started, NULL);
// end time
err_num = clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END,
sizeof(cl_ulong), &t_ended, NULL);
// complete time
err_num = clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_COMPLETE,
sizeof(cl_ulong), &t_completed, NULL);
printf("queue -> submit : %fus\n", (t_submitted - t_queued) * 1e-3);
printf("submit -> start : %fus\n", (t_started - t_submitted) * 1e-3);
printf("start -> end : %fus\n", (t_ended - t_started) * 1e-3);
printf("end -> finish : %fus\n", (t_completed - t_ended) * 1e-3);
}
显示总执行时间的函数如下:
timeval t_start;
long long time_diff;
timeval end;
gettimeofday(&t_start, NULL);
err_code = clEnqueueNDRangeKernel(cl_cmd_queue,
kernel,
2,
NULL,
global_work_size,
local_work_size,
0,
NULL,
&kernel_event);
err_code = clWaitForEvents(1, &kernel_event);
gettimeofday(&end, NULL);
time_diff = 1000000 * (end.tv_sec - start.tv_sec) + end.tv_usec - start.tv_usec;
printf(" ==>> OpenCL Gaussian Blur average cost: %lld us\n", func_name, time_diff);
执行结果如下:
我的问题是:
我看了官方文档,好像没有详细说明队列提交时间段做了什么事情。
对于范围较小的内核,运行时的主要部分可能不是内核运行时本身,而是 PCIe 数据传输。如果您在内核调用之前有一个非阻塞 CPU->GPU 内存副本,那么也会使用您的时钟进行测量。为了避免这种情况,请在时钟开始之前添加
clFinishQueue
。
非阻塞队列提交基本上具有瞬时运行时间。只有阻塞的
clFinishQueue
和 clWaitForEvents
告诉您排队内存复制和/或内核的运行时间。