0

I’m new to Dask. I’m currently working in an HPC managed by SLURM with some compute nodes (those that execute the jobs) and the login node (which I access through SSH to send the SLURM jobs). I’m looking to define my workflow for distributing my tasks to different CPUs from different nodes.

My first attempt consisted of a script that:

  1. initiated the SLURMCluster class with my custom setup
  2. obtained a client using the get_client() method of the SLURMCluster object.
  3. used map, gather, and submit methods of the Client object to distribute and manage tasks.
  4. Once all tasks are resolved, I close the client and cluster.

Then, I executed the script in login node following the procedure below:

  1. I SSH into the login node using port forwarding (to access the Dask dashboard on port 8787 from my local machine).
  2. I run the Python script on the login node.
  3. I can access the Dask dashboard while the script is running.

This workflow completed my tasks successfully but it has some limitations:

  • The Dask dashboard shuts down as soon as the Python script finishes.
  • I lose access to all dashboard information after the script completes.
  • This workflow forces me to keep a long-running process on the login node, which I’d like to avoid.
  • It also requires keeping my SSH session open, which is risky if my local machine shuts down or the connection is lost.

My question: Is there a more common or better approach for managing Dask tasks on an HPC system like this, while avoiding these issues? For example, how can I keep the Dask scheduler and dashboard running independently of the script execution?

Thanks for any guidance! I also made this question in Dask forum

1 Answer 1

0

As answered on Dask Discourse Forum.

For Dahsboard information after cluster shutdown, you've got the performance_report context manager.

Another solution would be to have one script for starting the Dask cluster, and another script to submit job to it. You can save Scheduler information into a json file to share it to another process.

For the long running process, a common workflow is to submit a “master” job, which will create Scheduler and client and submit other jobs for Workers through SLURMCluster. You can also take a look at dask-mpi.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.