我可以访问 HPC 系统。假设我有三个可用的节点/系统。各节点详情如下:
scontrol show node
Arch=x86_64 CoresPerSocket=10
CPUAlloc=20 CPUTot=20 CPULoad=22.67
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
.
.
RealMemory=91000 AllocMem=0 FreeMem=77291 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=cpu_normal_q
BootTime=2023-10-20T12:56:13 SlurmdStartTime=2023-10-20T12:57:43
CfgTRES=cpu=20,mem=91000M,billing=20
AllocTRES=cpu=20
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
我有一个 R 代码,它利用 R 中的
doParallel
包并执行并行计算。
library(doParallel)
library(Matrix)
# Set the number of cores to be used
num_cores <- detectCores()
# Initialize a parallel backend using doParallel
cl <- makeCluster(num_cores)
# Register the cluster for parallel processing
registerDoParallel(cl)
# Get the number of cores being utilized
cores_utilized <- getDoParWorkers()
# Function to perform matrix multiplication and inversion
matrix_mult_inv <- function() {
# Generate random matrices
mat <- matrix(rnorm(10000), nrow = 100)
# Perform matrix multiplication
result <- mat %*% mat
# Compute the inverse of the result matrix
inv_result <- solve(result)
return(inv_result)
}
# Record the start time
start_time <- Sys.time()
# Perform the matrix multiplication and inversion in parallel
result <- foreach(i = 1:300, .combine = cbind) %dopar% {
write.table(matrix_mult_inv(),paste("iteration_", i, ".txt", sep = ""))
}
# Record the end time
end_time <- Sys.time()
# Print the number of cores being utilized
print(paste("Number of cores being utilized:", cores_utilized))
# Print the time taken to run all the iterations:
print(paste("Time taken:", end_time - start_time))
# Stop the parallel backend
stopCluster(cl)
该代码设计为执行 300 次迭代,并且在每次迭代中完成矩阵乘法和求逆。输出是运行 300 次迭代所需的总时间和所使用的核心数量。
我的目标是在 HPC 环境中运行此代码,以便同时利用每个系统的 20 个核心,这样我总共就有 60 个核心。可以这样做吗?
我也从
parSapply
包中研究了 snow
,但我认为最终这取决于 makeCluster
功能。我尝试过
cl <- makeCluster(num_nodes, type = "SOCK", explicit = TRUE,
outfile = "", nodes = c(#3 specific node names input here#),
cpus = cores_per_node)
但这总共只使用了 3 个核心。
R 中的大部分并行计算开发都是针对笔记本电脑,并使 Windows、Mac 和 Unix 看起来相同(通过在幕后做不同的事情)。当过渡到 Unix 集群时,这会导致混乱并且通常效率低下。包 pbdMPI 是专门针对集群及其标准 MPI 实践而开发的。这是使用 pbdMPI 重写的代码。代码中注释了一些解释。要认识到的主要事情是,集群被设计为运行批处理作业,而您的
script.R
是串行代码的泛化,其中包含异步运行的相同代码的多个实例。这些代码的区别仅在于分配的 rank
。大多数等级管理都是自动化的。
library(pbdMPI)
library(Matrix)
# Function to perform matrix multiplication and inversion
matrix_mult_inv = function() {
# Generate random matrices
mat = matrix(rnorm(10000), nrow = 100)
# Perform matrix multiplication
result = mat %*% mat
# Compute the inverse of the result matrix
inv_result = solve(result)
return(inv_result)
}
n_mat = 300
my_mats = comm.chunk(n_mat, form = "vector") # split the work
size = comm.size()
rank = comm.rank()
cat("Hello from rank", rank, "of", size, "\n") # announce who is working
mat_list = list(length(my_mats))
for(i in my_mats) {
mat_list[[i]] = matrix_mult_inv()
## write.table(mat, paste("iteration_", rank, ".txt", sep = ""))
## If you will write out, it's a good plan that you use separate files
}
## It is probably faster to combine to rank 0 in memory (if not too big)
## you can also combine to all with allgather()
my_cbind_mats = do.call(cbind, mat_list)
all_mats_r0 = gather(my_cbind_mats) # only rank 0 gets result, others get NULL
## You can also write out at this point with bigger batches
if(rank == 0) {
print(dim(all_mats_r0))
## Here you can do further computation with all_mats_r0
}
## Or if you used allgather(), you'd have all_mats on all ranks (memory permitting)
## if you want to use parallel::mclapply() within ranks, you'll need
num_cores = Sys.getenv("SLURM_CPUS_PER_TASK")
comm.cat("my_cores:", num_cores, "\n", all.rank = TRUE)
finalize() # required! for graceful exit
要在 Slurm 集群上运行此代码,请将其保存在
script.R
中,将以下脚本保存在 script.sh
中,并使用 sbatch script.sh
在登录节点上提交。
#!/bin/bash
#SBATCH --job-name myjob
#SBATCH --account=<your-account>
#SBATCH --partition=<your-cpu-partition>
#SBATCH --mem=64g
#SBATCH --nodes=2
#SBATCH --cpus-per-task=1
#SBATCH --tasks-per-node=8
#SBATCH --time 00:20:00
#SBATCH -e ./myjob.e
#SBATCH -o ./myjob.o
pwd
module load r ## this may differ on your cluster
module list
time mpirun -np 16 Rscript script.R
以上所有
comm.chunk()
为每个排名返回不同的结果,以便每个排名适用于不同的数据。代码的计时最好在 shell 脚本中完成(在 time
前面添加 mpirun
),因为排名异步运行,任何内部计时可能无法反映集体性能。时间将转到文件中的错误输出 myjob.e
您的常规输出将转到 myjob.o
。有关更多信息,请参阅 pbdMPI 软件包文档,例如使用 comm.print()
和 comm.cat()
进行打印管理。
所有这些都应根据您的需求进行配置。如果您在每个等级中使用其他一些多线程代码(例如共享内存优势的
mclapply()
),则可以增加每个会话的 1 个核心。只需注意节点上的核心总数,不要超额订阅。