The following code example simply calls MPI_Barrier in a loop. On a 2 computer cluster of Intel machines, it runs correctly. When run from an Intel machine, with an AMD machine, it completes the first 3 loops without issue but consistently fails on the 4th loop.
#include <iostream>
#include <mpi.h>
int main()
{
int argc = 0;
MPI_Init(&argc, nullptr);
const int count = 100;
for (int i = 0; i < count; ++i)
{
std::cout << " Attempting Barrier " << i + 1 << std::endl;
MPI_Barrier(MPI_COMM_WORLD);
std::cout << " Completed Barrier " << i + 1 << std::endl;
}
MPI_Finalize();
}
command line: mpiexec -hosts 2 localhost amd_machine -wdir "\network\path" \path-to-exe
output:
[0] Attempting Barrier 1
[1] Attempting Barrier 1
[0] Completed Barrier 1
[0] Attempting Barrier 2
[1] Completed Barrier 1
[0] Completed Barrier 2
[1] Attempting Barrier 2
[0] Attempting Barrier 3
[0] Completed Barrier 3
[0] Attempting Barrier 4
[1] Completed Barrier 2
[1] Attempting Barrier 3
[1] Completed Barrier 3
[1] Attempting Barrier 4
job aborted:
[ranks] message
[0] terminated
[1] fatal error
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(MPI_COMM_WORLD) failed
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (errno 10060)
-wdira directory local to the machine in question? Just a thought,