如何使用 clGetDeviceInfo() 估计 GPU 性能

问题描述 投票:0回答:1

我正在尝试自动决定如何在 CPU 和 GPU 之间分配工作负载。

我想做的是遍历所有设备并通过乘法简单地评估 GFLOPS 的总理论计算能力:

GFLOPS = clock_speed * number_of_cores

* 是的,我知道这是非常粗略的,因为每个操作在每个体系结构上需要不同数量的时钟周期,并且由于现金丢失等原因导致效率低下。但这仍然是对粗略计算能力的一些粗略估计。

现在我可以得到

CL_DEVICE_MAX_CLOCK_FREQUENCY
CL_DEVICE_MAX_COMPUTE_UNITS
形式
clGetDeviceInfo()

虽然时钟频率

1777 MHz
似乎合理,但
28
comoute 单位似乎比我的
NVIDIA GeForce RTX 3060
低,根据
wikipedia
Core Config: 3584 112:48:28:112

查看文档

CL_DEVICE_MAX_COMPUTE_UNITS
The number of parallel compute units on the OpenCL device. A work-group executes on a single compute unit. The minimum value is 1.

好像是

GFLOPS = CL_DEVICE_MAX_CLOCK_FREQUENCY * CL_DEVICE_MAX_COMPUTE_UNITS * CL_DEVICE_MAX_WORK_GROUP_SIZE 

但这不适合我拥有

CL_DEVICE_MAX_WORK_GROUP_SIZE = 4096
的英特尔 CPU,这将使它比 GPU 更强大

我试图打印

clGetDeviceInfo()
提供的所有可能的设备参数,但似乎没有一个给我所需的信息:

OpenCL platform count 2 
DEVICE[0,0]: NVIDIA GeForce RTX 3060
        VENDOR:         NVIDIA Corporation
        DEVICE_VERSION: OpenCL 3.0 CUDA
        DRIVER_VERSION: 515.86.01
        C_VERSION:      OpenCL C 1.2 
        MAX_COMPUTE_UNITS:    28
        MAX_CLOCK_FREQUENCY:  1777  MHz 
        GLOBAL_MEM_SIZE:      12019 MB  
        LOCAL_MEM_SIZE:       48 kB  
        CONSTANT_BUFFER_SIZE: 64 kB  
        GLOBAL_MEM_CACHE_SIZE:     784 kB 
        GLOBAL_MEM_CACHELINE_SIZE: 128 
        MAX_WORK_ITEM_DIMENSIONS: 3  
        MAX_WORK_GROUP_SIZE:      1024 
        MAX_WORK_ITEM_SIZES:      [1024,1024,1024] 
        MIN_DATA_TYPE_ALIGN_SIZE: 128    
        
DEVICE[1,0]: pthread-Intel(R) Core(TM) i5-10400F CPU @ 2.90GHz
        VENDOR:         GenuineIntel
        DEVICE_VERSION: OpenCL 1.2 pocl HSTR: pthread-x86_64-pc-linux-gnu-skylake
        DRIVER_VERSION: 1.8
        C_VERSION:      OpenCL C 1.2 pocl 
        MAX_COMPUTE_UNITS:    12
        MAX_CLOCK_FREQUENCY:  4300  MHz 
        GLOBAL_MEM_SIZE:      13796 MB  
        LOCAL_MEM_SIZE:       256 kB  
        CONSTANT_BUFFER_SIZE: 256 kB  
        GLOBAL_MEM_CACHE_SIZE:     12288 kB 
        GLOBAL_MEM_CACHELINE_SIZE: 64 
        MAX_WORK_ITEM_DIMENSIONS: 3  
        MAX_WORK_GROUP_SIZE:      4096 
        MAX_WORK_ITEM_SIZES:      [4096,4096,4096] 
        MIN_DATA_TYPE_ALIGN_SIZE: 128   

相关:OpenCL:被 CL_DEVICE_MAX_COMPUTE_UNITS 混淆

编辑:

通过阅读一些资料(wiki、nVidia 文档)我可以发现每个计算单元 有 128 个着色器核心(或 CUDA 核心)。但这是一些特定于供应商的信息,我必须作为用户从外部来源进行搜索。我想通过

clGetDeviceInfo()

提供的信息自动估计程序的计算能力
gpu opencl
1个回答
0
投票

有2个规格主要决定GPU的性能:

  • FP32 TFlops(在某些情况下还有 FP64/FP16 TFlops)
  • 内存带宽(在某些情况下也是缓存带宽)

OpenCL

clGetDeviceInfo
没有提供这些。但是:你得到了CU的数量
CL_DEVICE_MAX_COMPUTE_UNITS
,根据数据表
CL_DEVICE_MAX_CLOCK_FREQUENCY
的峰值核心时钟速度,以及每周期的指令数(ipc),这至少可以给你一个很好的FP32 TFlops估计:

int compute_units = (uint)cl_device.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>(); // compute units (CUs) can contain multiple cores depending on the microarchitecture
int clock_frequency = (uint)cl_device.getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>(); // in MHz
int ipc = ipc = cl_device.getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU ? 2 : 32; // IPC (instructions per cycle) is 2 for GPUs and 32 for most modern CPUs

// int cores_per_cu = ?;

int cores = compute_units*cores_per_cu;
float tflops = 1E-6f*(float)cores*(float)ipc*(float)clock_frequency;

还有一个未知数:每个 CU 的核心数。这取决于 GPU 微架构,可以是:

  • 1/2
    (具有超线程的 CPU)
  • 1
    (没有 HT 的 CPU)
  • 8
    (英特尔 iGPU/dGPU、ARM GPU)
  • 64
    (Nvidia P100/Volta/Turing/A100/A30,AMD GCN/CDNA)
  • 128
    (Nvidia Maxwell/Pascal/Ampere/Hopper/Ada,AMD RDNA/RDNA2)
  • 192
    (英伟达开普勒)
  • 256
    (AMD RDNA3)

您可以通过检查设备供应商+名称来解决这个问题,这将揭示微体系结构:

string name = trim(cl_device.getInfo<CL_DEVICE_NAME>()); // device name
string vendor = trim(cl_device.getInfo<CL_DEVICE_VENDOR>()); // device vendor

bool nvidia_192_cores_per_cu = contains_any(to_lower(name), {"gt 6", "gt 7", "gtx 6", "gtx 7", "quadro k", "tesla k"}) || (clock_frequency<1000u&&contains(to_lower(name), "titan")); // identify Kepler GPUs
bool nvidia_64_cores_per_cu = contains_any(to_lower(name), {"p100", "v100", "a100", "a30", " 16", " 20", "titan v", "titan rtx", "quadro t", "tesla t", "quadro rtx"}) && !contains(to_lower(name), "rtx a"); // identify P100, Volta, Turing, A100, A30
bool amd_128_cores_per_dualcu = contains(to_lower(name), "gfx10"); // identify RDNA/RDNA2 GPUs where dual CUs are reported
bool amd_256_cores_per_dualcu = contains(to_lower(name), "gfx11"); // identify RDNA3 GPUs where dual CUs are reported

float cores_per_cu_nvidia = (float)(contains(to_lower(vendor), "nvidia"))*(nvidia_64_cores_per_cu?64.0f:nvidia_192_cores_per_cu?192.0f:128.0f); // Nvidia GPUs have 192 cores/CU (Kepler), 128 cores/CU (Maxwell, Pascal, Ampere, Hopper, Ada) or 64 cores/CU (P100, Volta, Turing, A100, A30)
float cores_per_cu_amd = (float)(contains_any(to_lower(vendor), {"amd", "advanced"}))*(is_gpu?(amd_256_cores_per_dualcu?256.0f:amd_128_cores_per_dualcu?128.0f:64.0f):0.5f); // AMD GPUs have 64 cores/CU (GCN, CDNA), 128 cores/dualCU (RDNA, RDNA2) or 256 cores/dualCU (RDNA3), AMD CPUs (with SMT) have 1/2 core/CU
float cores_per_cu_intel = (float)(contains(to_lower(vendor), "intel"))*(is_gpu?8.0f:0.5f); // Intel integrated GPUs usually have 8 cores/CU, Intel CPUs (with HT) have 1/2 core/CU
float cores_per_cu_apple = (float)(contains(to_lower(vendor), "apple"))*(128.0f); // Apple ARM GPUs usually have 128 cores/CU
float cores_per_cu_arm = (float)(contains(to_lower(vendor), "arm"))*(is_gpu?8.0f:1.0f); // ARM GPUs usually have 8 cores/CU, ARM CPUs have 1 core/CU

int cores_per_cu = (int)(cores_per_cu_nvidia+cores_per_cu_amd+cores_per_cuintel+cores_per_cu_apple+cores_per_cu_arm+0.5f); // for CPUs, compute_units is the number of threads (twice the number of cores with hyperthreading)

在这里找到完整的源代码。这个估计对于绝大多数 CPU 和 GPU 都是正确的。然而,也有一些值得注意的例外:

  • 没有超线程的 CPU 将只检测到一半的内核。
  • CPU
    ipc=32
    仅适用于支持 AVX2 的 CPU,这是绝大多数现代 CPU。很老的CPU可能只支持AVX有
    ipc=16
    ,有的HEDT CPU支持AVX512有
    ipc=64
  • 一些 GPU 具有相同的名称,但可以是 2 种不同的微架构,例如 GTX 860M(Kepler 或 Maxwell)。这些不容易区分,需要更高级的查找表。

不幸的是,没有办法计算出内存带宽。为此,您要么必须执行包含数百个 GPU 的扩展查找表,要么执行快速基准测试。

© www.soinside.com 2019 - 2024. All rights reserved.