Common workflow for using Dask on HPC systems

Question

I’m new to Dask. I’m currently working in an HPC managed by SLURM with some compute nodes (those that execute the jobs) and the login node (which I access through SSH to send the SLURM jobs). I’m looking to define my workflow for distributing my tasks to different CPUs from different nodes.

My first attempt consisted of a script that:

initiated the SLURMCluster class with my custom setup
obtained a client using the get_client() method of the SLURMCluster object.
used map, gather, and submit methods of the Client object to distribute and manage tasks.
Once all tasks are resolved, I close the client and cluster.

Then, I executed the script in login node following the procedure below:

I SSH into the login node using port forwarding (to access the Dask dashboard on port 8787 from my local machine).
I run the Python script on the login node.
I can access the Dask dashboard while the script is running.

This workflow completed my tasks successfully but it has some limitations:

The Dask dashboard shuts down as soon as the Python script finishes.
I lose access to all dashboard information after the script completes.
This workflow forces me to keep a long-running process on the login node, which I’d like to avoid.
It also requires keeping my SSH session open, which is risky if my local machine shuts down or the connection is lost.

My question: Is there a more common or better approach for managing Dask tasks on an HPC system like this, while avoiding these issues? For example, how can I keep the Dask scheduler and dashboard running independently of the script execution?

Thanks for any guidance! I also made this question in Dask forum

Guillaume EB · Accepted Answer · 2024-10-18 17:17:56Z

0

As answered on Dask Discourse Forum.

For Dahsboard information after cluster shutdown, you've got the performance_report context manager.

Another solution would be to have one script for starting the Dask cluster, and another script to submit job to it. You can save Scheduler information into a json file to share it to another process.

For the long running process, a common workflow is to submit a “master” job, which will create Scheduler and client and submit other jobs for Workers through SLURMCluster. You can also take a look at dask-mpi.

answered Oct 18, 2024 at 17:17

Guillaume EB

8226 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Common workflow for using Dask on HPC systems

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related