0

I have written a C++ code that uses both OpenMP and OpenMPI. I want to use (let's say) 3 nodes (so size_Of_Cluster should be 3) and use OpenMP in each node to parallelize the for loop (There are 24 cores in a node). In essence I want MPI ranks be assigned to nodes. The Slurm script I have written is as follows. (I have tried many variations but could not come up with the "correct" one. I would be grateful if you could help me.)

#!/bin/bash
#SBATCH -N 3
#SBATCH -n 72
#SBATCH -p defq
#SBATCH -A akademik
#SBATCH -o %J.out
#SBATCH -e %J.err
#SBATCH --job-name=MIXED

module load slurm
module load shared
module load gcc
module load openmpi

export OMP_NUM_THREADS=24

mpirun -n 3 --bynode ./program

Using srun did not help.

2
  • And what is the problem/question? Commented Jan 7, 2023 at 12:57
  • On my local Mac with M1Max chip and using just 3 cores (no OpenMP), the pure OpenMPI algorithm takes about 30 mins to complete. When I used OpenMP + OpenMPI and run it using the script example in the question, I expected a quicker completion time. But it did not run quicker. In one try (I do not remember the exact script now), it took about 45 mins. So I suspect that the ranks are not distributed among nodes, but I am not sure. In short, all I desire is to assign ranks to nodes, and appreciate any help in this reagard. Commented Jan 7, 2023 at 14:29

1 Answer 1

0

The relevant lines are:

#SBATCH -N 3
#SBATCH -n 72

export OMP_NUM_THREADS=24

This means you have 72 MPI processes, and each creates 24 thread. For that to be efficient you probably need 24x72 cores. Which you don't have. You should specify:

#SBATCH -n 3

Then you will have 3 processes, with 24 threads per process.

You don't have to worry about the placement of the ranks on the nodes: that is done by the run time. You could for instance let each process print MPI_Get_processor_name to confirm.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you very much for your kind answer. I tested the script with a simple program, and as you mentioned MPI_Get_processor_name function returned the desired output. However when I run my original program it took about 45 mins to complete. So it seems that the problem is not with the script. Maybe I should optimize the OpenMP for loops. Best wishes.
Use -N 3 instead of -n 3 for sbatch. Avoid mpirun (the argument depends on OpenMPI, mpich2, intel mpi et al), try to make srun work. In one of my cloud cases, my binaries were compiled with Intel icc/icpc, I have to do: I_MPI_PMI_LIBRARY=/opt/gridview/slurm/lib/libpmi2.so srun --mpi=pmi2

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.