I’m new to Dask. I’m currently working in an HPC managed by SLURM with some compute nodes (those that execute the jobs) and the login node (which I access through SSH to send the SLURM jobs). I’m looking to define my workflow for distributing my tasks to different CPUs from different nodes.
My first attempt consisted of a script that:
- initiated the SLURMCluster class with my custom setup
- obtained a client using the get_client() method of the SLURMCluster object.
- used map, gather, and submit methods of the Client object to distribute and manage tasks.
- Once all tasks are resolved, I close the client and cluster.
Then, I executed the script in login node following the procedure below:
- I SSH into the login node using port forwarding (to access the Dask dashboard on port 8787 from my local machine).
- I run the Python script on the login node.
- I can access the Dask dashboard while the script is running.
This workflow completed my tasks successfully but it has some limitations:
- The Dask dashboard shuts down as soon as the Python script finishes.
- I lose access to all dashboard information after the script completes.
- This workflow forces me to keep a long-running process on the login node, which I’d like to avoid.
- It also requires keeping my SSH session open, which is risky if my local machine shuts down or the connection is lost.
My question: Is there a more common or better approach for managing Dask tasks on an HPC system like this, while avoiding these issues? For example, how can I keep the Dask scheduler and dashboard running independently of the script execution?
Thanks for any guidance! I also made this question in Dask forum