使用cuda shared内存进行图像过滤的正确方法

Question

我正在编写用于图像过滤的 CUDA C++ 代码。 CUDA 将图像数据分成块以进行并行处理。对于常规的像素级处理当然速度很快。然而在图像过滤中，对于每个像素，我们需要相邻像素将其与过滤器掩模（过滤器内核）进行卷积。

现在，对于输入图像中位于 CUDA 块边界上的像素，相邻像素将位于相邻块上，这需要处理器的不同块之间进行通信，这使得处理速度急剧下降！

据我了解，优化这种情况的解决方案是使用共享内存，并将处理该块所需的所有像素保留在该

__shared__

数组上。

我考虑了一个填充的

__shared__

数组来保留每个块的相邻像素并继续进行过滤。

这是我写的一段代码：

/// in the header:
#define MASK_WIDTH 5
#define TILE_SIZE 8
/// ...

__global__ void local_filt
    (  
        const unsigned char* inputImage,
        unsigned char* outputImage,
        const int * filterKernel,
        int height,
        int width
    )
{
    __shared__ unsigned char tile[TILE_SIZE + MASK_WIDTH - 1][TILE_SIZE + MASK_WIDTH - 1];
        
    int tx = threadIdx.x;
    int ty = threadIdx.y;
    int bx = blockIdx.x * TILE_SIZE;
    int by = blockIdx.y * TILE_SIZE;
    int row = by + ty;
    int col = bx + tx;
    int cx = MASK_WIDTH / 2;
    int cy = MASK_WIDTH / 2;

    __syncthreads();

    if (row < height && col < width) {

        tile[ty + cy][tx + cx] = inputImage[row * width + col];
        
        if (ty < cy)
        {
             ///----------------------------
             // Some code to fill the border pixels of the tile
             ///----------------------------
        }
        
    }

    if (row < height && col < width) 
    {
        unsigned char tmp_ext = 0;  /// For max val
        
        for (int i = 0; i < MASK_WIDTH; ++i) 
        {
            for (int j = 0; j < MASK_WIDTH; ++j) 
            {
                 if (tmp_ext < tile[ty + i][tx + j])
                     tmp_ext = tile[ty + i][tx + j] * filterKernel[i][j];
            }
        }
        outputImage[row * width + col] =  tmp_ext;
    }
}

对于复制到tile[ty + cy][tx + cx]中的当前块像素，没有问题，但无论我尝试以何种方式将相邻块的像素保留在图块的边界上，仍然有许多图块像素保持未分配状态。

（这部分代码：

if (ty < cy)
{
    // Some code to fill the border pixels of the tile
}

我尝试了很多 if-else 来填充图块的那些特定像素，但输出图像在这些边界区域中被损坏。

为了更好地了解问题，请查看这些图像：

这里的问题是如何用相邻块的像素正确填充那些边界像素（第二张图像中的绿色像素）以使过滤更快？

我什至不确定这个方法是否合理，所以如果有更好的解决方案请提及。

如上所述，我有一些像下面这样的 if-else 来填充图块的边框：

if (ty < cy + 1 && ty > 0)
{
   if (row - 2 * ty > -1) // For the first block which has no adjacent block on the left
      tile[cy - ty][tx + cx] = inputImage[(row - 2 * ty) * width + col]; // replica! We can use inputImage[0 * width + col] for repeat.
   else
      tile[cy - ty][tx + cx] = inputImage[ty * width + col];
}

我重复了类似的 if-else 来覆盖顶部/底部/左/右，但没有成功。

Answer 1

将“边界像素”定义放在一边，将“只读邻居”数据视为单个块。

从您分享的图像来看，它看起来是 12x12 平铺。这在初始化

___shared___

邻居数据期间需要 144 个活动线程，或者使用较少数量的线程进行多次重复。如果每行加载最少数量的内存访问，则读取会非常高效。这意味着，每行应该由 12 个线程的倍数加载。映射可以像这样轻松计算：

const int nTileThreads = 32; // can be any number as you like, 64/128/256 should be ok too
const tileWidth = 12;
const int nTile = tileWidth * tileWidth ;
const int numberOfLooping = 1 + nTile/nTileThreads;
__shared__ float tile[nTile];
for(int i=0;i<numberOfLooping;i++)
{
    // straightforward since its dense target
    const int tilePixelIndex = i*nTileThreads +  threadIdx.x;

    // not straightforward since its not dense (needs to jump imageWidth items)
    const int tileGlobalIndex = globalIndexOfFirstElement() + (tilePixelIndex / tileWidth ) * imageWidth + (tilePixelIndex % tileWidth);
    if(tilePixelIndex < nTile)
    {
        tile[tilePixelIndex] = globalArray[tileGlobalIndex];
    }
}
__syncthreads();

这里，

globalIndexOfFirstElement()

只是当前图块左上角的索引。因此，如果它是 1D 内核，则需要将 x 和 y 坐标转换（展平）为 1D。

技巧是将图像划分为具有扁平内核或 2D 内核的图块，并将其所有邻居作为单个循环进行计算，并避免溢出图像尺寸。例如，图像边框上的图块不应从外部读取，因为它会读取中间部分的错误元素和角落的溢出。为了克服这个问题，您可以通过第二个较慢的内核和多个 if-else 单独计算边界图块。

使用cuda shared内存进行图像过滤的正确方法

问题描述投票：0回答：1

1个回答

最新问题

使用cuda __shared__内存进行图像过滤的正确方法

问题描述 投票：0回答：1

1个回答

最新问题

使用cuda shared内存进行图像过滤的正确方法

问题描述投票：0回答：1