0

I'm working as an intern currently, and I was asked to build a Singularity container for OpenMPI to make distributed programming possible on multiple machines of our HPC cluster using containers.

Basically, I'm wondering what is the correct approach for something like this : do I want my container to take OpenMPI, UCX and Mellanox OFED (that we use together for InfiniBand) from outside, using something like environment-modules, or do I want to compile OFED, UCX and OpenMPI inside my container ? We could also consider hybrid solutions, like compiling OpenMPI and UCX inside the container but taking our OFED libraries from outside the container.

I would like to know if anyone has worked on a similar case before and can help me. What is the best practice ? What limits will I face with the different solutions I spoke of ?

So far I've already built a container with OFED, UCX and OpenMPI compiled inside, one with nothing inside but the small setup for environment-modules, and one sort of hybrid, where I compile OpenMPI and UCX inside the container, and install libraries using dnf, and take the little bits of OFED that are still missing to run InfiniBand from outside. The thing is that I've built my container with a RHEL8.2 OS, which is the same as my host machine, and I don't know what issues I could face if I were to build a container with a RHEL7 or RHEL9 OS for example.

Now to provide more details about our stack, we use LSF as our job scheduler, though I haven't worked with it yet. Our machines are mostly RHEL8.2 and 8.6. Some older machines have mlx4_0 InfiniBand devices, while the newer have mlx5_0. I have installed OpenMPI-5.0.3 and Singularity-ce-4.1.2 on my host machine, along with UCX-1.16.0 so that OpenMPI can use our InfiniBand network.

Inside my container, I compile MLNX_OFED_LINUX-4.9-7.1.0.0-rhel8.2-x86_64, UCX-1.16.0 and OpenMPI-5.0.3 from tarballs. I've created a module for Openmpi-5.0.3 and one for UCX-1.16.0 that my container uses with environment-modules.

Any help is appreciated.

2
  • Update : I have been working on different things and tried to create a "hybrid" container where I try to bind different folders of my machine, like the libraries, to the container, and then add them to $LD_LIBRARY_PATH or $PATH, but I just get an error on the glibc library (it seems that the container is trying to use the version of glibc of the bind dir and not its own). Commented Jun 24, 2024 at 8:27
  • Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Commented Jun 24, 2024 at 10:20

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.