Create a fully managed Slurm cluster with two A4 VMs

This quickstart explains how to create and connect to a Slurm cluster by using Cluster Director. The cluster that you create uses two A4 virtual machine (VM) instances, which are engineered to help your Slurm cluster efficiently handle large-scale model training and inference workloads.

Cluster Director is a managed service that simplifies and automates cluster deployment, reducing operational overhead and letting you focus on running your workload. If you want more control over the deployment and management of your cluster, then create a Slurm cluster by using Cluster Toolkit.


To follow step-by-step guidance for this task directly in the Google Cloud console, click Guide me:

Guide me


Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  5. Verify that billing is enabled for your Google Cloud project.

  6. Enable the Hypercompute Cluster API, Compute Engine API, Filestore API, Google Cloud Managed Lustre API, Cloud Logging API, and Cloud Monitoring API:

    Enable the APIs
  7. Verify that your project and the Compute Engine default service account have the following Identity and Access Management (IAM) roles:
  8. If the organization in which your project exists has a trusted image policy (constraints/compute.trustedImageProjects), then verify that the clusterdirector-public-images project is included in the list of allowed projects. To view the trusted image policies for your organization, see Set image access constraints.

Costs

This quickstart uses the following billable Google Cloud resources:

  • Compute Engine:

    • Two VMs with A4 machine types

    • One Persistent Disk volume for the Slurm login node with 100 GB

    • One Google Cloud Hyperdisk Balanced volume with 100 GB for the A4 VMs

  • Filestore: a Filestore instance with 10 TiB (10,240 GiB)

To generate a cost estimate based on your projected usage, use the pricing calculator.

Create a Slurm cluster

To create a Slurm cluster, complete the following steps:

  1. In the Google Cloud console, go to the Cluster Director page.

    Go to Cluster Director

  2. Click Create a cluster.

  3. In the dialog that appears, click Step-by-step configuration. The Create cluster page appears.

  4. In the Cluster name field, enter cluster001.

  5. In the Compute section, click Configure resources. In the Add resource configuration pane that appears, complete the following steps:

    1. In the GPU type list, select NVIDIA B200 180GB.

    2. In the Number of instances field, enter 2.

    3. In the Consumption options section, select the consumption option that you want to use to obtain resources.

    4. In the Location section, specify the Region and Zone where you want to create your A4 VMs, or where the reservation that you want to use to create your VMs exists.

    5. Click Done.

  6. In the navigation menu, click Storage.

  7. In the Storage section, click Edit storage configuration. In the Add storage configuration pane that appears, complete the following steps:

    1. In the Capacity section, select 10-100 TiB, with increments of 2.5 TiB.

    2. Click Done.

  8. Click Create. The Clusters page appears.

    Creating the cluster can take some time to complete. The completion time depends on the number of VMs that you request and resource availability in the VMs' zone. If your requested resources are unavailable, then Cluster Director maintains the creation request until resources become available.

View the cluster creation request

To review the cluster creation request, complete the following steps:

  1. In the Clusters table, in the Name column, click cluster001. A page that gives the details of the cluster appears, and the Details tab is selected.

  2. In the Compute section, locate the Status row. When AI Hypercomputer sets its value to Ready, you can proceed to the next section.

Connect to your cluster through SSH

To connect to your cluster through SSH, complete the following steps:

  1. Click the Nodes tab.

  2. In the Login nodes table, find the row that contains the cluster001-login-001 node. In that row, in the Connect column, click the SSH button. The SSH-in-browser window appears.

  3. If prompted, then click Authorize. Connecting to your cluster can take some time to complete. When the terminal is ready, proceed to the next section.

Run sample jobs

In the SSH-in-browser window, complete the following steps:

  1. To verify that Slurm is running, run the following command:

    sinfo
    
  2. To submit a test job that returns the hostname of the node, run the following command:

    srun hostname
    
  3. To submit a batch job that sleeps for 30 seconds, run the following command:

    sbatch --wrap="sleep 30"
    
  4. To check the status of jobs in the queue, run the following command:

    squeue
    
  5. To view accounting data for jobs, run the following command:

    sacct
    

You've successfully created a Slurm cluster, connected to it, and run sample jobs. If AI Hypercomputer still hasn't created the A4 VMs, then you can wait for the cluster to create the VMs, modify the cluster to add or remove VMs, or delete the cluster to avoid incurring any unnecessary charges.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.

Delete your project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Delete your cluster

To delete the cluster, and its associated resources, that you created as part of this quickstart, complete the following steps:

  1. On the page that contains the details of your cluster, click Delete.

  2. In the dialog that appears, enter cluster001, and then click Delete to confirm.

What's next