我遇到了一个非常奇怪的错误,其中调用MPI_Bcast发出了错误的值。当我检查在根进程中发送的值时,它会打印出正确的值,但是在所有其他任务中,它会打印出旧值。
我曾尝试查找类似的问题,但所有结果都是由于人们要么不对所有任务调用Bcast,要么尝试在更适合聚会的地方使用它。
由于Bcast没有发送正确的数据,我的下层任务最终陷入了无限循环。使用^ C可以使其正确退出,但是我需要实现代码以自行退出。 (如果Bcast会表现出来,应该这样做)
代码的简化版:
include statements
int main(int argc, char *argv[])
{
variable declarations here
MPI is initialized up here
if (rank ==0)
{
input = fopen(argv[1],"r");
output = fopen("fakeOutput.txt", "w+");
while(1)
{
count = 1; //reset counter
if(exitFlag) break;
command = (char*)calloc(89,sizeof(char));
command[0] = '.';
command[1] = '/';
batLine = (char*)calloc(86,sizeof(char));
for(i=0; i < 16; i++)
{
if(fgets(batLine,86,input) != NULL)
{
if(loopCount> 0)
{
Continue = true;
MPI_Bcast(&Continue,1,MPI_C_BOOL,0,MPI_COMM_WORLD);
}
if(i==0)
{
strcat(command,batLine);
printf("rank0 gets: %s\n", command);
fflush(stdout);
}
else
{
MPI_Send(batLine,85,MPI_CHAR,i,i,MPI_COMM_WORLD);
printf("sent rank%d: %s\n",i,batLine);
fflush(stdout);
count++;
}
else
{
Continue = false;
exitFlag = true; //flag to break out of while loop
free(batLine);
batLine = (char*)calloc(86,sizeof(char));
batLine[0]='e';
MPI_Send(batLine,85,MPI_CHAR,i,i,MPI_COMM_WORLD);
}
}
free(batLine);
//system(command); ///to run batch file line
delay(500); //to simulate time of the command running
MPI_Barrier(MPI_COMM_WORLD);
fprintf(output,"%s", command); //rank 0 has first spectrum
free(command);
outputFile = (char*)calloc(33,sizeof(char));
for (i=1; i<count;i++) //task 0 doesn't send data. have to start at 1
{
MPI_Recv(outputFile,33,MPI_CHAR,i,16,MPI_COMM_WORLD,&stat2);
printf("rank 0 recieved data from %d\n",stat2.MPI_SOURCE);
fflush(stdout);
fprintf(output,"%s\n",outputFile);
printf("Data:%s\n",outputFile);
}
MPI_Barrier(MPI_COMM_WORLD);
printf("continue after barrier:%d\n",Continue);
free(outputFile);
loopCount ++;
if(exitFlag)
{
Continue = false;
MPI_Bcast(&Continue,1,MPI_C_BOOL,0,MPI_COMM_WORLD);
printf("sent:%d\n", Continue);
break;
}
fclose(input);
fclose(output);
printf("files closed\n");
}
else
{
while(1)
{
command = (char*)calloc(89,sizeof(char));
sentbatch = (char*)calloc(86,sizeof(char));
spectrum = (char*)calloc(33, sizeof(char));
command[0] = '.';
command[1] = '/';
MPI_Recv(sentbatch,86,MPI_CHAR,0,rank,MPI_COMM_WORLD,&stat);
printf("rank%d was sent data from%d\n",rank,stat.MPI_SOURCE);
fflush(stdout);
if(strncmp(sentbatch,"e",1) == 0)
{
noSend = true; //don't want to send back the place holder data
}
strcat(command,sentbatch); //adds needed ./ before batch data
free(sentbatch); //don't want to waste memory space
//system(command); //should run batch line
Switch statement to give different delay times to different tasks here
MPI_Barrier(MPI_COMM_WORLD);
if(noSend == false)
{
for(i=0; i<31; i++)
{
spectrum[i] = command[i+13];
}
free(command);
fflush(stdout);
printf("sending:%s\n",spectrum);
fflush(stdout);
MPI_Send(spectrum, 33,MPI_CHAR,0,16,MPI_COMM_WORLD);
free(spectrum);
}
MPI_Barrier(MPI_COMM_WORLD);
MPI_Bcast(&lowContinue,1,MPI_C_BOOL,0,MPI_COMM_WORLD);
fflush(stdout);
printf("continue: %d\n",lowContinue);
if(lowContinue == false)
break;
}
printf("end for rank%d \n", rank);
}
MPI_Finalize();
printf("closed mpi");
return(0);
}
我正在从腻子窗口中复制代码,因此,如果缺少任何括号,请假定它们存在。它们都在代码本身中匹配,但是从nano和腻子复制是最糟糕的。
我知道它最初看起来可能很奇怪,但是等级0循环在较低等级之前开始,因此对于我正在测试的输入文件,所有任务的BCast最终被调用2次。该代码跨16个任务运行,仅传递26行。
没有错误要显示,但这是打印出来的语句的结尾,删除了重复的行:
rank 0 recieved data from 9
Data:spec-56321-GAC099N59V1_sp01-042
continue after barrier:0
sent:0
continue: 1 (present 15 times)
files closed
fgets正在扫描的文件:
LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-006.flx spec-56321-GAC099N59V1_sp01-006.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-008.flx spec-56321-GAC099N59V1_sp01-008.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-010.flx spec-56321-GAC099N59V1_sp01-010.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-013.flx spec-56321-GAC099N59V1_sp01-013.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-015.flx spec-56321-GAC099N59V1_sp01-015.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-018.flx spec-56321-GAC099N59V1_sp01-018.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-022.flx spec-56321-GAC099N59V1_sp01-022.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-023.flx spec-56321-GAC099N59V1_sp01-023.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-024.flx spec-56321-GAC099N59V1_sp01-024.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-025.flx spec-56321-GAC099N59V1_sp01-025.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-028.flx spec-56321-GAC099N59V1_sp01-028.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-029.flx spec-56321-GAC099N59V1_sp01-029.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-030.flx spec-56321-GAC099N59V1_sp01-030.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-031.flx spec-56321-GAC099N59V1_sp01-031.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-032.flx spec-56321-GAC099N59V1_sp01-032.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-033.flx spec-56321-GAC099N59V1_sp01-033.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-035.flx spec-56321-GAC099N59V1_sp01-035.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-037.flx spec-56321-GAC099N59V1_sp01-037.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-038.flx spec-56321-GAC099N59V1_sp01-038.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-039.flx spec-56321-GAC099N59V1_sp01-039.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-040.flx spec-56321-GAC099N59V1_sp01-040.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-042.flx spec-56321-GAC099N59V1_sp01-042.nor f
而且,是的,即使从技术上讲我也不应该将Continue用作变量。它用大写字母表示,所以它不同于continue,并且因为我喜欢使用具有逻辑意义的变量名。 (或者我最终对所有内容都使用了最笨拙的名称)
您将MPI_Bcast用于有条件块中的根进程(rank = 0)以及其他没有任何条件的块。因此,如果根MPI_Bcast不执行while循环的任何迭代,则其他处理器将永远等待。