NVIDIA Visual Profiler不会生成时间轴

问题描述 投票:1回答:1

我的问题几乎与[之前在SO询问] [1]的问题相同。但是没有给它答案,所以我要问一个单独的问题。

我在Windows-7操作系统上使用CUDA 7.0工具包。我正在使用VS-2013。

我试图生成矢量加法样本程序的时间轴,并且它有效。但是,当我按照完全相同的步骤生成我自己的代码的时间轴时,它会不断显示消息“运行应用程序以生成时间轴”。我知道内核被调用,一切正常。

在完成与CUDA相关的所有事情后,cudaDeviceReset()电话也在那里。

程序:我已经改变了我原来的问题,提供了一个可以产生同样问题的最小工作示例。以下代码不使用nvvp生成时间轴,无论我放置cudaDeviceReset()的位置如何。

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

//OpenCV
#include <opencv2/highgui.hpp>
#include <opencv2/core.hpp>
#include <opencv2/imgproc.hpp>

#include <stdio.h>

using namespace cv;

__global__ void colorTransformation_kernel(int numChannels, int iw, int ih, unsigned char *ptr_source, unsigned char *ptr_dst)
{
    // Calculate our pixel's location
    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
    int y = (blockIdx.y * blockDim.y) + threadIdx.y;

    // Operate only if we are in the correct boundaries
    if (x >= 0 && x < iw && y >= 0 && y < ih)
    {   
        ptr_dst[numChannels*  (iw*y + x) + 0] = ptr_source[numChannels*  (iw*y + x) + 0];
        ptr_dst[numChannels*  (iw*y + x) + 1] = ptr_source[numChannels*  (iw*y + x) + 1];
        ptr_dst[numChannels*  (iw*y + x) + 2] = ptr_source[numChannels*  (iw*y + x) + 2];
    }
}

int main()
{
    while (1)
    { 
        Mat image(400, 400, CV_8UC3, Scalar(0, 0, 255));
        unsigned char *h_src = image.data;
        size_t numBytes = image.rows * image.cols * 3;
        int numChannels = 3;


        unsigned char *dev_src, *dev_dst, *h_dst;

        //Allocate memomry at device for SOURCE and DESTINATION and get their pointers
        cudaMalloc((void**)&dev_src, numBytes * sizeof(unsigned char));
        cudaMalloc((void**)&dev_dst, numBytes * sizeof(unsigned char));

        ////Copy the source image to the device i.e. GPU
        cudaMemcpy(dev_src, h_src, numBytes * sizeof(unsigned char), cudaMemcpyHostToDevice);

        ////KERNEL
        dim3 numOfBlocks(3 * (image.cols / 20), 3 * (image.rows / 20)); //multiplied by 3 because we have 3 channel image now
        dim3 numOfThreadsPerBlocks(20, 20);
        colorTransformation_kernel << <numOfBlocks, numOfThreadsPerBlocks >> >(numChannels, image.cols, image.rows, dev_src, dev_dst);
        cudaDeviceSynchronize();

        //Get the processed image 
        Mat org_dijSDK_img(image.rows, image.cols, CV_8UC3);
        h_dst = org_dijSDK_img.data;
        cudaMemcpy(h_dst, dev_dst, numBytes * sizeof(unsigned char), cudaMemcpyDeviceToHost);

        //DISPLAY PROCESSED IMAGE           
        imshow("Processed dijSDK image", org_dijSDK_img);
        waitKey(33);

    }

    cudaDeviceReset();
    return 0;
}

非常重要的线索:如果我评论行while(1)并因此只运行一次代码,那么nvvp会生成时间轴。但是在我的原始项目中,我无法通过这样做获得时间线配置文件,因为它包含多线程和其他内容,因此在第一次运行期间没有要处理的图像。所以,我必须用一些方法来生成包含无限while loop的代码的时间轴。

c++ cuda nvidia
1个回答
1
投票

我的代码中的问题是无休止的while loop,因为cudaDeviceReset()从未被调用。有两种可能的解决方案来处理这种情况:

  1. 如果你有兴趣只看一下时间线分析,只需评论你的while loopnvvp就可以到达cudaDeviceReset()末尾的main()
  2. 可能存在必须在程序中保留循环的情况。例如,在我的包含多线程的原始项目中,在while loop的最初180次运行期间没有要处理的图像。要处理这种情况,请使用可以运行有限次数的for loop替换while循环。例如,以下代码帮助我获得了4次运行的时间线分析。我只发布修改后的main()int main() { cudaStream_t stream_one; cudaStream_t stream_two; cudaStream_t stream_three; //while (1) for (int i = 0; i < 4; i++) { cudaStreamCreate(&stream_one); cudaStreamCreate(&stream_two); cudaStreamCreate(&stream_three); Mat image = imread("DijSDK_test_image.jpg", 1); //Mat image(1080, 1920, CV_8UC3, Scalar(0,0,255)); size_t numBytes = image.rows * image.cols * 3; int numChannels = 3; int iw = image.rows; int ih = image.cols; size_t totalMemSize = numBytes * sizeof(unsigned char); size_t oneThirdMemSize = totalMemSize / 3; unsigned char *dev_src_1, *dev_src_2, *dev_src_3, *dev_dst_1, *dev_dst_2, *dev_dst_3, *h_src, *h_dst; //Allocate memomry at device for SOURCE and DESTINATION and get their pointers cudaMalloc((void**)&dev_src_1, (totalMemSize) / 3); cudaMalloc((void**)&dev_src_2, (totalMemSize) / 3); cudaMalloc((void**)&dev_src_3, (totalMemSize) / 3); cudaMalloc((void**)&dev_dst_1, (totalMemSize) / 3); cudaMalloc((void**)&dev_dst_2, (totalMemSize) / 3); cudaMalloc((void**)&dev_dst_3, (totalMemSize) / 3); //Get the processed image Mat org_dijSDK_img(image.rows, image.cols, CV_8UC3, Scalar(0, 0, 255)); h_dst = org_dijSDK_img.data; //copy new data of image to the host pointer h_src = image.data; //Copy the source image to the device i.e. GPU cudaMemcpyAsync(dev_src_1, h_src, (totalMemSize) / 3, cudaMemcpyHostToDevice, stream_one); cudaMemcpyAsync(dev_src_2, h_src + oneThirdMemSize, (totalMemSize) / 3, cudaMemcpyHostToDevice, stream_two); cudaMemcpyAsync(dev_src_3, h_src + (2 * oneThirdMemSize), (totalMemSize) / 3, cudaMemcpyHostToDevice, stream_three); //KERNEL--stream-1 callMultiStreamingCudaKernel(dev_src_1, dev_dst_1, numChannels, iw, ih, &stream_one); //KERNEL--stream-2 callMultiStreamingCudaKernel(dev_src_2, dev_dst_2, numChannels, iw, ih, &stream_two); //KERNEL--stream-3 callMultiStreamingCudaKernel(dev_src_3, dev_dst_3, numChannels, iw, ih, &stream_three); //RESULT copy: GPU to CPU cudaMemcpyAsync(h_dst, dev_dst_1, (totalMemSize) / 3, cudaMemcpyDeviceToHost, stream_one); cudaMemcpyAsync(h_dst + oneThirdMemSize, dev_dst_2, (totalMemSize) / 3, cudaMemcpyDeviceToHost, stream_two); cudaMemcpyAsync(h_dst + (2 * oneThirdMemSize), dev_dst_3, (totalMemSize) / 3, cudaMemcpyDeviceToHost, stream_three); // wait for results cudaStreamSynchronize(stream_one); cudaStreamSynchronize(stream_two); cudaStreamSynchronize(stream_three); //Assign the processed data to the display image. org_dijSDK_img.data = h_dst; //DISPLAY PROCESSED IMAGE imshow("Processed dijSDK image", org_dijSDK_img); waitKey(33); } cudaDeviceReset(); return 0; }
© www.soinside.com 2019 - 2024. All rights reserved.