Segmentation fault (11) Error while running a CFD solver in Linux cluster

问题描述 投票:0回答:0

我正在为许多设计点运行 CFD 求解器(CFD++ by metacomp technologies)。每个设计点都有一个包含所有所需文件的单独文件夹,为了在 Linux 集群中提交一个模拟,我们通常有一个提交 bash 代码 1(代码 1,我在下面附上),我需要将其放在同一个文件夹中并提交它。但是因为我需要为许多设计点运行这些,所以我为每个设计点创建了一个不同的文件夹,并将这个提交 bash 代码放在外面,并在所有设计点文件夹上循环,如代码 2 所示。当我单独运行模拟时,它们完成了总迭代次数,但是当我将它们与代码 2 结合运行时,我得到了如图所示的分段错误。请帮助我。

代码-1:

#!/bin/bash -login
# Propogate environment variables to compute node
#SBATCH --export=ALL

#SBATCH --partition iist-all

# set the number of nodes and processes per node
#SBATCH --nodes=1


# set the number of tasks (processes) per node.
#SBATCH --ntasks-per-node=10

# set name of job
#SBATCH --job-name=job1

# mail alert at start, end and abortion of execution
#SBATCH --mail-type=ALL

# send mail to this address
#SBATCH [email protected]

modul epurge
module load gnu7/7.3.0
module load openmpi3/3.1.0
module load metacomp/18.1

mcmetis pmetis 1 $SLURM_NPROCS

srun hostname -s | sort | uniq > hosts.$SLURM_JOB_ID

echo $SLURM_JOB_ID
echo $SLURM_NPROCS

#export METACOMP_LICENSE_FILE=27000@virgo
export [email protected]

#mpirun ompi310mcfd
#mpirun -np $SLURM_NPROCS -machinefile hosts.$SLURM_JOB_ID ompi310mcfd
#mpirun --display-map --display-allocation ompi310mcfd >& mcfd.log </dev/null 
time mpirun ompi310mcfd >& mcfd.log </dev/null 

exbc2do1 exbcsin.bin pltosout.bin 4
exbc2do1 exbcsin.bin pltosout.bin 5

infout1f 1
infout1f 2

cinfout2 mcfd.info2.bcs4 > pressure_bc4.txt
cinfout2 mcfd.info2.bcs5 > pressure_bc5.txt

cinfout3 mcfd.info3.bcs4 > heatflux_bc4.txt
cinfout3 mcfd.info3.bcs5 > heatflux_bc5.txt

代码-2:

#!/bin/bash -login
# Propogate environment variables to compute node
#SBATCH --export=ALL

#SBATCH --partition iist-all

# set the number of nodes and processes per node
#SBATCH --nodes=1


# set the number of tasks (processes) per node.
#SBATCH --ntasks-per-node=28

# set name of job
#SBATCH --job-name=ddc_s2

# mail alert at start, end and abortion of execution
#SBATCH --mail-type=ALL

# send mail to this address
#SBATCH [email protected]



modul epurge
module load gnu7/7.3.0
module load openmpi3/3.1.0
module load metacomp/18.1


srun hostname -s | sort | uniq > hosts.$SLURM_JOB_ID
echo $SLURM_JOB_ID
echo $SLURM_NPROCS
#export METACOMP_LICENSE_FILE=27000@virgo
export [email protected]


for i in {19..326..1}
#{start..ending..increment}
do

    cd "/scratch2/nagarjun/CFD++/Design_domin_configaration/supersonic_2/dp$i/"


    
    mcmetis pmetis 1 $SLURM_NPROCS
  
    #mpirun ompi310mcfd
    #mpirun -np $SLURM_NPROCS -machinefile hosts.$SLURM_JOB_ID ompi310mcfd
    #mpirun --display-map --display-allocation ompi310mcfd >& mcfd.log </dev/null
    time mpirun -machinefile ../hosts.$SLURM_JOB_ID ompi310mcfd >& mcfd.log </dev/null

    
    exbc2do1 exbcsin.bin pltosout.bin 4
    exbc2do1 exbcsin.bin pltosout.bin 5

    infout1f 1
    infout1f 2

    cinfout2 mcfd.info2.bcs4 > pressure_bc4.txt
    cinfout2 mcfd.info2.bcs5 > pressure_bc5.txt

    cinfout3 mcfd.info3.bcs4 > heatflux_bc4.txt
    cinfout3 mcfd.info3.bcs5 > heatflux_bc5.txt


    python3 post_processing.py > post_processing.txt


    cd "/scratch2/nagarjun/CFD++/Design_domin_configaration/supersonic_2/"
    
        echo -e "dp$i completed \n \n \n \n \n \n \n \n \n \n \n "
done

cd "/scratch2/nagarjun/CFD++/Design_domin_configaration/supersonic_2/"

python3 post_processing_main.py > post_processing_main.txt

这里,post_processing.py 和 post_processing_main.py 是我编写的用于从求解器输出中提取有用数据的 python 代码。

ERROR MESSG-1(仅输出文件的最后一部分):

---------------------------------------------
Max QN message size in commgr2y = 17472 bytes
---------------------------------------------
--------------------------------------------------------------
 CPU#   MEMALLO   NATCELLS   NE1CELLS   ALLCELLS
           (MB)
  MIN        33       7426       7601       7782
  MAX        41       7428       7979       8562
  ALL      1111     207942     216383     225155
28 CPUs (Ranks) used in this run
--------------------------------------------------------------
  step#           time        delta-t         rhsave         rhsmax
              x_rhsmax       y_rhsmax       z_rhsmax     cel_rhsmax
                eigmax         eigmin         cflglo         cflloc
                rvoave         cputim         clktim
      1  0.0000000e+00  4.3392773e+05  1.0818139e+10  6.1142817e+12
         2.1148776e+00  1.5977611e+00  0.0000000e+00         112894
         2.3045312e+09  1.0296349e+04  1.0000000e+15  5.0000000e-01
         0.0000000e+00  2.1000000e-01  0.0000000e+00
Elapsed time reported from movnodes is 0 seconds
Elapsed time reported from movnodez is 0 seconds
      2  0.0000000e+00  3.5745482e+05  8.7225555e+11  3.5212554e+14
         2.0227659e+00  1.6349497e+00  0.0000000e+00         105291
         2.7975563e+09  1.0296349e+04  1.0000000e+15  5.0000000e-01
         0.0000000e+00  1.1000000e-01  0.0000000e+00
      3  0.0000000e+00  3.8378202e+05  5.7762863e+11  1.9649679e+14
         2.0227659e+00  1.6349497e+00  0.0000000e+00         105291
         2.6056458e+09  1.0296349e+04  1.0000000e+15  5.0000000e-01
         0.0000000e+00  1.2000000e-01  0.0000000e+00
      4  0.0000000e+00  4.2751015e+05  2.2390567e+11  8.0079392e+13
         2.1148776e+00  1.5977611e+00  0.0000000e+00         112894
         2.3391258e+09  1.0296349e+04  1.0000000e+15  5.4522613e-01
         0.0000000e+00  1.7000000e-01  0.0000000e+00
mgl=1  mg_mxne=120
[cn15:126768] *** Process received signal ***
[cn15:126768] Signal: Segmentation fault (11)
[cn15:126768] Signal code:  (128)
[cn15:126768] Failing at address: (nil)
[cn15:126768] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7fbcbd905630]
[cn15:126768] [ 1] /opt/ohpc/pub/mpi/openmpi3-gnu7/3.1.0/lib/openmpi/mca_btl_vader.so(+0x4660)[0x7fbcb0b94660]
[cn15:126768] [ 2] /opt/ohpc/pub/mpi/openmpi3-gnu7/3.1.0/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fbcbc1a926c]
[cn15:126768] [ 3] /opt/ohpc/pub/mpi/openmpi3-gnu7/3.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv+0x305)[0x7fbcabde1205]
[cn15:126768] [ 4] /opt/ohpc/pub/mpi/openmpi3-gnu7/3.1.0/lib/libmpi.so.40(PMPI_Recv+0x175)[0x7fbcbd2e2065]
[cn15:126768] [ 5] ompi310mcfd[0x8dbd8f]
[cn15:126768] [ 6] ompi310mcfd[0x993089]
[cn15:126768] [ 7] ompi310mcfd[0xa2b268]
[cn15:126768] [ 8] ompi310mcfd[0xa29de2]
[cn15:126768] [ 9] ompi310mcfd[0x41b697]
[cn15:126768] [10] ompi310mcfd[0x40c3ef]
[cn15:126768] [11] ompi310mcfd[0x4342af]
[cn15:126768] [12] ompi310mcfd[0x405a6d]
[cn15:126768] [13] ompi310mcfd(__gxx_personality_v0+0x2d5)[0x4049c5]
[cn15:126768] [14] ompi310mcfd(__gxx_personality_v0+0x2a6)[0x404996]
[cn15:126768] [15] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fbcbc998555]
[cn15:126768] [16] ompi310mcfd(__gxx_personality_v0+0x1e9)[0x4048d9]
[cn15:126768] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 8 with PID 126768 on node cn15 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

我只写了输出消息的最后一部分,那里有错误。我的求解器应该运行 2000 次迭代,但它只在 4 次迭代时停止。一些模拟(设计点)得到相同的错误输出。但是一些模拟(设计点)正在运行高达 2000 次迭代。我又给出了一个错误消息和一个完全完成的模拟。

ERROR MESSG-2(仅输出文件的最后一部分):

   1553  0.0000000e+00  1.6408007e+06  1.1023116e+08  1.0672477e+10
         1.4840060e+00  1.6172885e+00  0.0000000e+00         103630
         6.0945854e+08  9.4094477e+03  1.0000000e+15  2.4866261e+00
         0.0000000e+00  1.1000000e-01  1.0000000e+00
   1554  0.0000000e+00  1.6408416e+06  1.1135935e+08  1.0885199e+10
         1.6516934e+00  1.3657756e+00  0.0000000e+00         112657
         6.0944335e+08  9.4094477e+03  1.0000000e+15  2.4990449e+00
         0.0000000e+00  1.1000000e-01  0.0000000e+00
   1555  0.0000000e+00  1.6408802e+06  1.1269408e+08  1.1154374e+10
         1.6516934e+00  1.3657756e+00  0.0000000e+00         112657
         6.0942903e+08  9.4094477e+03  1.0000000e+15  2.5114637e+00
         0.0000000e+00  1.2000000e-01  0.0000000e+00
   1556  0.0000000e+00  1.6409162e+06  1.1416695e+08  1.1393675e+10
         1.6516934e+00  1.3657756e+00  0.0000000e+00         112657
         6.0941565e+08  9.4094477e+03  1.0000000e+15  2.5238825e+00
         0.0000000e+00  1.1000000e-01  0.0000000e+00
[cn15:126026] *** Process received signal ***
[cn15:126026] Signal: Segmentation fault (11)
[cn15:126026] Signal code:  (128)
[cn15:126026] Failing at address: (nil)
[cn15:126026] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f8f53b33630]
[cn15:126026] [ 1] /opt/ohpc/pub/mpi/openmpi3-gnu7/3.1.0/lib/openmpi/mca_btl_vader.so(+0x4660)[0x7f8f42db9660]
[cn15:126026] [ 2] /opt/ohpc/pub/mpi/openmpi3-gnu7/3.1.0/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f8f523d726c]
[cn15:126026] [ 3] /opt/ohpc/pub/mpi/openmpi3-gnu7/3.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv+0x305)[0x7f8f42180205]
[cn15:126026] [ 4] /opt/ohpc/pub/mpi/openmpi3-gnu7/3.1.0/lib/libmpi.so.40(PMPI_Recv+0x175)[0x7f8f53510065]
[cn15:126026] [ 5] ompi310mcfd[0x8dbd8f]
[cn15:126026] [ 6] ompi310mcfd[0x992d65]
[cn15:126026] [ 7] ompi310mcfd[0xa28fdc]
[cn15:126026] [ 8] ompi310mcfd[0x41b697]
[cn15:126026] [ 9] ompi310mcfd[0x40c3ef]
[cn15:126026] [10] ompi310mcfd[0x4342af]
[cn15:126026] [11] ompi310mcfd[0x405a6d]
[cn15:126026] [12] ompi310mcfd(__gxx_personality_v0+0x2d5)[0x4049c5]
[cn15:126026] [13] ompi310mcfd(__gxx_personality_v0+0x2a6)[0x404996]
[cn15:126026] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f8f52bc6555]
[cn15:126026] [15] ompi310mcfd(__gxx_personality_v0+0x1e9)[0x4048d9]
[cn15:126026] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 126026 on node cn15 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

完全模拟(仅输出文件的最后一部分):

   1995  0.0000000e+00  1.6153643e+06  6.3016055e+08  9.2328134e+10
         3.0567245e+00  1.2558381e+00  0.0000000e+00         187234
         6.1905540e+08  1.0123746e+04  1.0000000e+15  2.5240145e+00
         0.0000000e+00  1.2000000e-01  1.0000000e+00
   1996  0.0000000e+00  1.6130436e+06  6.3887208e+08  9.5184621e+10
         3.0567245e+00  1.2558381e+00  0.0000000e+00         187234
         6.1994604e+08  1.0123746e+04  1.0000000e+15  2.5399235e+00
         0.0000000e+00  1.0000000e-01  0.0000000e+00
   1997  0.0000000e+00  1.6108123e+06  6.4890969e+08  9.9572878e+10
         3.0512015e+00  1.2533241e+00  0.0000000e+00         187169
         6.2080481e+08  1.0123746e+04  1.0000000e+15  2.5558324e+00
         0.0000000e+00  1.0000000e-01  0.0000000e+00
   1998  0.0000000e+00  1.6086835e+06  6.5977869e+08  1.0795679e+11
         3.0512015e+00  1.2533241e+00  0.0000000e+00         187169
         6.2162633e+08  1.0123746e+04  1.0000000e+15  2.5717413e+00
         0.0000000e+00  1.0000000e-01  0.0000000e+00
   1999  0.0000000e+00  1.6066719e+06  6.7159249e+08  1.1362385e+11
         3.0512015e+00  1.2533241e+00  0.0000000e+00         187169
         6.2240463e+08  1.0123746e+04  1.0000000e+15  2.5876502e+00
         0.0000000e+00  1.1000000e-01  0.0000000e+00
   2000  0.0000000e+00  1.6047884e+06  6.8466500e+08  1.1792991e+11
         3.0492703e+00  1.2526297e+00  0.0000000e+00         187109
         6.2313511e+08  1.0123746e+04  1.0000000e+15  2.6035591e+00
         0.0000000e+00  1.1000000e-01  0.0000000e+00
Timestamp at Step#2000: Wed Feb 22 16:08:55 2023
FILE(pltosout1.c)LINE(566): 201504 nodes processed
pltosout.bin (plot output file) created at nt=2000, tau= 0.0000000e+00
Elapsed time for output to pltosout.bin is 0 seconds
FILE(cdepsoup1.c)LINE(467): 200860 cells processed
Elapsed time for stage2 from cdepsoup1 is 0 seconds
FILE(cdeproup1.c)LINE(443): 200860 cells processed
Elapsed time from cdeproup1 is 0 seconds
Elapsed time for stage3 from cdepsoup1 is 0 seconds
cdepsout.bin (restart file) created at nt=2000, tau= 0.0000000e+00
Elapsed time for output to cdepsout.bin is 0 seconds
Wed Feb 22 16:08:55 2023
--------------------------------------------------------------
 CPU#   MEMALLO   NATCELLS   NE1CELLS   ALLCELLS
           (MB)
  MIN        39       7173       7312       7456
  MAX        63       7175       7658       8193
  ALL      1179     200860     209125     217739
28 CPUs (Ranks) used in this run
--------------------------------------------------------------
#------------------------------------------
#System commands encountered at end of run:
609: #-------------------------------------
610: system begin
611: system end
#------------------------------------------
#----------------------------------------------------
#Listing of system commands remembered at end of run:
#----------------------------------------------------
##################FlexLM_License_Checkin_Begin##################
-------------------------------------------------------
Feature "CFD++_SOLV_Ser":
Version Limit = 18.1
Expiry Date = 15-apr-2023
Total Number of Licenses = 8
Number of Licenses reserved in this run = 1
Server Name = 172.20.0.252
Server Port#= 27000
Daemon Name = METACOMP
Your FlexLM license will expire in 52 days on 15-apr-2023
Feature "CFD++_SOLV_Ser" has now been checked back in
-------------------------------------------------------
-------------------------------------------------------
Feature "CFD++_CP_Ser":
Version Limit = 18.1
Expiry Date = 15-apr-2023
Total Number of Licenses = 8
Number of Licenses reserved in this run = 1
Server Name = 172.20.0.252
Server Port#= 27000
Daemon Name = METACOMP
Your FlexLM license will expire in 52 days on 15-apr-2023
Feature "CFD++_CP_Ser" has now been checked back in
-------------------------------------------------------
-------------------------------------------------------
Feature "CFD++_CR_Ser":
Version Limit = 18.1
Expiry Date = 15-apr-2023
Total Number of Licenses = 8
Number of Licenses reserved in this run = 1
Server Name = 172.20.0.252
Server Port#= 27000
Daemon Name = METACOMP
Your FlexLM license will expire in 52 days on 15-apr-2023
Feature "CFD++_CR_Ser" has now been checked back in
-------------------------------------------------------
-------------------------------------------------------
Feature "CFD++_SOLV_Par2":
Version Limit = 18.1
Expiry Date = 15-apr-2023
Total Number of Licenses= 240
Number of Licenses reserved in this run = 27
Server Name = 172.20.0.252
Server Port#= 27000
Daemon Name = METACOMP
Feature "CFD++_SOLV_Par2" has now been checked back in for 27 CPUs
-------------------------------------------------------
Your FlexLM license will expire in 52 days on 15-apr-2023
-------------------------------------------------------
Feature "CFD++_CP_Par2":
Version Limit = 18.1
Expiry Date = 15-apr-2023
Total Number of Licenses= 240
Number of Licenses reserved in this run = 27
Server Name = 172.20.0.252
Server Port#= 27000
Daemon Name = METACOMP
Feature "CFD++_CP_Par2" has now been checked back in for 27 CPUs
-------------------------------------------------------
Your FlexLM license will expire in 52 days on 15-apr-2023
-------------------------------------------------------
Feature "CFD++_CR_Par2":
Version Limit = 18.1
Expiry Date = 15-apr-2023
Total Number of Licenses= 240
Number of Licenses reserved in this run = 27
Server Name = 172.20.0.252
Server Port#= 27000
Daemon Name = METACOMP
Feature "CFD++_CR_Par2" has now been checked back in for 27 CPUs
-------------------------------------------------------
Your FlexLM license will expire in 52 days on 15-apr-2023
FlexLM License Management Successfully Completed
###################FlexLM_License_Checkin_End###################
"CPU" time = 234.51 seconds
"CLOCK" time = 237 seconds
Wed Feb 22 16:08:55 2023
#################################################
process id = 25458
process invocation: ompi310mcfd
process name = mpimcfd
my_id=0, my_sz=28
CFD++: Version 18.1, Update 5
Compilation TIMESTAMP: Dec  4 2018 13:33:20
MCFD_LICEXT TIMESTAMP: Thu Nov 22 06:09:31 2018
-------------------------------------------------
Computer hostname is cn15
Computer system type is LINUX
-------------processID HostName List-------------
25458 cn15 Rank0
25459 cn15 Rank1
25460 cn15 Rank2
25461 cn15 Rank3
25462 cn15 Rank4
25463 cn15 Rank5
25465 cn15 Rank6
25467 cn15 Rank7
25470 cn15 Rank8
25472 cn15 Rank9
25473 cn15 Rank10
25476 cn15 Rank11
25477 cn15 Rank12
25481 cn15 Rank13
25483 cn15 Rank14
25485 cn15 Rank15
25486 cn15 Rank16
25487 cn15 Rank17
25489 cn15 Rank18
25491 cn15 Rank19
25493 cn15 Rank20
25497 cn15 Rank21
25499 cn15 Rank22
25501 cn15 Rank23
25502 cn15 Rank24
25503 cn15 Rank25
25505 cn15 Rank26
25508 cn15 Rank27
#################################################
Current working directory is:
/scratch2/nagarjun/CFD++/Design_domin_configaration/supersonic_2/dp25
-------CPU# Current Working Directory Variance List-------
#################################################
#################################################
Current shell is:
/bin/bash
--------------CPU# Current Shell Variance List------------
#################################################
#################################################
Current Software Version, Update:
CFD++: Version 18.1, Update 5
CPU# Current Software Version, Update Variances:-
#################################################
This software is protected by copyright laws
of the United States of America.
This software is subject to SBIR Data Rights clauses
and unauthorized disclosure and distribution are prohibited.

我期待我的 code2 发生变化,这样我就不会得到这个分段错误,并且我的所有模拟都可以运行到迭代总数。请帮助我

linux mpi slurm hpc openmpi
© www.soinside.com 2019 - 2024. All rights reserved.