我正在尝试自动决定如何在 CPU 和 GPU 之间分配工作负载。
我想做的是遍历所有设备并通过乘法简单地评估 GFLOPS 的总理论计算能力:
GFLOPS = clock_speed * number_of_cores
* 是的,我知道这是非常粗略的,因为每个操作在每个体系结构上需要不同数量的时钟周期,并且由于现金丢失等原因导致效率低下。但这仍然是对粗略计算能力的一些粗略估计。
现在我可以得到
CL_DEVICE_MAX_CLOCK_FREQUENCY
和CL_DEVICE_MAX_COMPUTE_UNITS
形式clGetDeviceInfo()
虽然时钟频率
1777 MHz
似乎合理,但 28
comoute 单位似乎比我的 NVIDIA GeForce RTX 3060
低,根据 wikipedia有
Core Config: 3584 112:48:28:112
查看文档:
CL_DEVICE_MAX_COMPUTE_UNITS
The number of parallel compute units on the OpenCL device. A work-group executes on a single compute unit. The minimum value is 1.
好像是
GFLOPS = CL_DEVICE_MAX_CLOCK_FREQUENCY * CL_DEVICE_MAX_COMPUTE_UNITS * CL_DEVICE_MAX_WORK_GROUP_SIZE
但这不适合我拥有
CL_DEVICE_MAX_WORK_GROUP_SIZE = 4096
的英特尔 CPU,这将使它比 GPU 更强大
我试图打印
clGetDeviceInfo()
提供的所有可能的设备参数,但似乎没有一个给我所需的信息:
OpenCL platform count 2
DEVICE[0,0]: NVIDIA GeForce RTX 3060
VENDOR: NVIDIA Corporation
DEVICE_VERSION: OpenCL 3.0 CUDA
DRIVER_VERSION: 515.86.01
C_VERSION: OpenCL C 1.2
MAX_COMPUTE_UNITS: 28
MAX_CLOCK_FREQUENCY: 1777 MHz
GLOBAL_MEM_SIZE: 12019 MB
LOCAL_MEM_SIZE: 48 kB
CONSTANT_BUFFER_SIZE: 64 kB
GLOBAL_MEM_CACHE_SIZE: 784 kB
GLOBAL_MEM_CACHELINE_SIZE: 128
MAX_WORK_ITEM_DIMENSIONS: 3
MAX_WORK_GROUP_SIZE: 1024
MAX_WORK_ITEM_SIZES: [1024,1024,1024]
MIN_DATA_TYPE_ALIGN_SIZE: 128
DEVICE[1,0]: pthread-Intel(R) Core(TM) i5-10400F CPU @ 2.90GHz
VENDOR: GenuineIntel
DEVICE_VERSION: OpenCL 1.2 pocl HSTR: pthread-x86_64-pc-linux-gnu-skylake
DRIVER_VERSION: 1.8
C_VERSION: OpenCL C 1.2 pocl
MAX_COMPUTE_UNITS: 12
MAX_CLOCK_FREQUENCY: 4300 MHz
GLOBAL_MEM_SIZE: 13796 MB
LOCAL_MEM_SIZE: 256 kB
CONSTANT_BUFFER_SIZE: 256 kB
GLOBAL_MEM_CACHE_SIZE: 12288 kB
GLOBAL_MEM_CACHELINE_SIZE: 64
MAX_WORK_ITEM_DIMENSIONS: 3
MAX_WORK_GROUP_SIZE: 4096
MAX_WORK_ITEM_SIZES: [4096,4096,4096]
MIN_DATA_TYPE_ALIGN_SIZE: 128
相关:OpenCL:被 CL_DEVICE_MAX_COMPUTE_UNITS 混淆
编辑:
通过阅读一些资料(wiki、nVidia 文档)我可以发现每个计算单元 有 128 个着色器核心(或 CUDA 核心)。但这是一些特定于供应商的信息,我必须作为用户从外部来源进行搜索。我想通过
clGetDeviceInfo()
提供的信息自动估计程序的计算能力
有2个规格主要决定GPU的性能:
OpenCL
clGetDeviceInfo
没有提供这些。但是:你得到了CU的数量CL_DEVICE_MAX_COMPUTE_UNITS
,根据数据表CL_DEVICE_MAX_CLOCK_FREQUENCY
的峰值核心时钟速度,以及每周期的指令数(ipc),这至少可以给你一个很好的FP32 TFlops估计:
int compute_units = (uint)cl_device.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>(); // compute units (CUs) can contain multiple cores depending on the microarchitecture
int clock_frequency = (uint)cl_device.getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>(); // in MHz
int ipc = ipc = cl_device.getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU ? 2 : 32; // IPC (instructions per cycle) is 2 for GPUs and 32 for most modern CPUs
// int cores_per_cu = ?;
int cores = compute_units*cores_per_cu;
float tflops = 1E-6f*(float)cores*(float)ipc*(float)clock_frequency;
还有一个未知数:每个 CU 的核心数。这取决于 GPU 微架构,可以是:
1/2
(具有超线程的 CPU)1
(没有 HT 的 CPU)8
(英特尔 iGPU/dGPU、ARM GPU)64
(Nvidia P100/Volta/Turing/A100/A30,AMD GCN/CDNA)128
(Nvidia Maxwell/Pascal/Ampere/Hopper/Ada,AMD RDNA/RDNA2)192
(英伟达开普勒)256
(AMD RDNA3)您可以通过检查设备供应商+名称来解决这个问题,这将揭示微体系结构:
string name = trim(cl_device.getInfo<CL_DEVICE_NAME>()); // device name
string vendor = trim(cl_device.getInfo<CL_DEVICE_VENDOR>()); // device vendor
bool nvidia_192_cores_per_cu = contains_any(to_lower(name), {"gt 6", "gt 7", "gtx 6", "gtx 7", "quadro k", "tesla k"}) || (clock_frequency<1000u&&contains(to_lower(name), "titan")); // identify Kepler GPUs
bool nvidia_64_cores_per_cu = contains_any(to_lower(name), {"p100", "v100", "a100", "a30", " 16", " 20", "titan v", "titan rtx", "quadro t", "tesla t", "quadro rtx"}) && !contains(to_lower(name), "rtx a"); // identify P100, Volta, Turing, A100, A30
bool amd_128_cores_per_dualcu = contains(to_lower(name), "gfx10"); // identify RDNA/RDNA2 GPUs where dual CUs are reported
bool amd_256_cores_per_dualcu = contains(to_lower(name), "gfx11"); // identify RDNA3 GPUs where dual CUs are reported
float cores_per_cu_nvidia = (float)(contains(to_lower(vendor), "nvidia"))*(nvidia_64_cores_per_cu?64.0f:nvidia_192_cores_per_cu?192.0f:128.0f); // Nvidia GPUs have 192 cores/CU (Kepler), 128 cores/CU (Maxwell, Pascal, Ampere, Hopper, Ada) or 64 cores/CU (P100, Volta, Turing, A100, A30)
float cores_per_cu_amd = (float)(contains_any(to_lower(vendor), {"amd", "advanced"}))*(is_gpu?(amd_256_cores_per_dualcu?256.0f:amd_128_cores_per_dualcu?128.0f:64.0f):0.5f); // AMD GPUs have 64 cores/CU (GCN, CDNA), 128 cores/dualCU (RDNA, RDNA2) or 256 cores/dualCU (RDNA3), AMD CPUs (with SMT) have 1/2 core/CU
float cores_per_cu_intel = (float)(contains(to_lower(vendor), "intel"))*(is_gpu?8.0f:0.5f); // Intel integrated GPUs usually have 8 cores/CU, Intel CPUs (with HT) have 1/2 core/CU
float cores_per_cu_apple = (float)(contains(to_lower(vendor), "apple"))*(128.0f); // Apple ARM GPUs usually have 128 cores/CU
float cores_per_cu_arm = (float)(contains(to_lower(vendor), "arm"))*(is_gpu?8.0f:1.0f); // ARM GPUs usually have 8 cores/CU, ARM CPUs have 1 core/CU
int cores_per_cu = (int)(cores_per_cu_nvidia+cores_per_cu_amd+cores_per_cuintel+cores_per_cu_apple+cores_per_cu_arm+0.5f); // for CPUs, compute_units is the number of threads (twice the number of cores with hyperthreading)
在这里找到完整的源代码。这个估计对于绝大多数 CPU 和 GPU 都是正确的。然而,也有一些值得注意的例外:
ipc=32
仅适用于支持 AVX2 的 CPU,这是绝大多数现代 CPU。很老的CPU可能只支持AVX有ipc=16
,有的HEDT CPU支持AVX512有ipc=64
。不幸的是,没有办法计算出内存带宽。为此,您要么必须执行包含数百个 GPU 的扩展查找表,要么执行快速基准测试。