CUDA 调用图

3
投票

您可以通过 clang 3.8 的 CUDA 支持来完成此操作。首先，编译 CUDA 代码以发出 llvm（在安装了 CUDA 7.5 的 Windows 上的示例）：

clang++ -c main.cu --cuda-gpu-arch=sm_35 -o main.ll -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\include"

然后，使用生成的 ll 来构建调用图

opt

opt.exe main.ll -analyze -dot-callgraph

请注意，

opt

不是默认二进制发行版的一部分，您可能需要自己构建它（我有一个3.7.1版本，它已经能够管理3.8中的ll）。

main.cu 文件示例：

#include <cuda_runtime.h> __device__ int f() { return 1; } __device__ float g(float* a) { return a[f()] ; } __device__ float h() { return 42.0f ; } __global__ void kernel (int a, float* b) { int c = a + f(); g(b); b[c] = h(); }

生成的点文件：

digraph "Call graph" { label="Call graph"; Node0x1e3d438 [shape=record,label="{external node}"]; Node0x1e3d438 -> Node0x1e3cfb0; Node0x1e3d438 -> Node0x1e3ce48; Node0x1e3d438 -> Node0x1e3d0a0; Node0x1e3d438 -> Node0x1e3d258; Node0x1e3d438 -> Node0x1e3cfd8; Node0x1e3d438 -> Node0x1e3ce98; Node0x1e3d438 -> Node0x1e3d000; Node0x1e3d438 -> Node0x1e3cee8; Node0x1e3d438 -> Node0x1e3d078; Node0x1e3d000 [shape=record,label="{__cuda_module_ctor}"]; Node0x1e3d000 -> Node0x1e3ce98; Node0x1e3d000 -> Node0x1e3d168; Node0x1e3d078 [shape=record,label="{__cuda_module_dtor}"]; Node0x1e3d078 -> Node0x1e3cee8; Node0x1e3cfb0 [shape=record,label="{^A?f@@YAHXZ}"]; Node0x1e3d0a0 [shape=record,label="{^A?h@@YAMXZ}"]; Node0x1e3ce48 [shape=record,label="{^A?g@@YAMPEAM@Z}"]; Node0x1e3ce48 -> Node0x1e3cfb0; Node0x1e3d258 [shape=record,label="{^A?kernel@@YAXHPEAM@Z}"]; Node0x1e3d258 -> Node0x1e3cfb0; Node0x1e3d258 -> Node0x1e3ce48; Node0x1e3d258 -> Node0x1e3d0a0; Node0x1e3d168 [shape=record,label="{__cuda_register_kernels}"]; Node0x1e3cee8 [shape=record,label="{__cudaUnregisterFatBinary}"]; Node0x1e3cee8 -> Node0x1e3d528; Node0x1e3cfd8 [shape=record,label="{__cudaRegisterFunction}"]; Node0x1e3cfd8 -> Node0x1e3d528; Node0x1e3ce98 [shape=record,label="{__cudaRegisterFatBinary}"]; Node0x1e3ce98 -> Node0x1e3d528; }

0
投票

CUDA 11.3 版本

开始，设备代码链接器 nvlink 还可以生成汇编程序级调用图。使用与接受的答案相同的示例，但将设备功能和调用内核分成两个翻译单元并链接：

$ cat callgraph_f.cu
__device__ int f() { return 1; }
__device__ float g(float* a) { return a[f()] ; }
__device__ float h() { return 42.0f ; }

$ cat callgraph.cu
#include <cuda_runtime.h>
extern __device__ int f();
extern __device__ float g(float* a);
extern __device__ float h();

__global__ void kernel (int a, float* b)
{
        int c = a + f();
        g(b);
        b[c] = h();
}

$ nvcc -dc callgraph_f.cu 
$ nvcc -dc callgraph.cu
$ nvcc -Xnvlink -dump-callgraph callgraph.o callgraph_f.o
callgraph for sm_52:
# A: s -> B // s (number A) potentially calls B
# s [N]     // s uses N registers
# ^s        // s is entry point
# &s        // s has address taken
1: ^kernel(int, float *) [6] -> 5 4 3
2: 
3: f() [5] ->
4: g(float *) [8] ->
5: h() [5] ->
regcount 8 for g(float *) propagated to entry kernel(int, float *)

这与其他答案中显示的基于 Clang 的调用图略有不同，因为 Clang 方法在 PTX 汇编之前使用高级编译器生成的 IR，而

nvlink

调用图应该在编译所有优化过程之后还可以显示调用图的寄存器信息。哪个更有用取决于您的调用图的用例。

问题描述投票：0回答：2

2个回答

最新问题

CUDA 调用图

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2