dstack¶
dstack is an open-source, streamlined alternative to Kubernetes and Slurm, specifically designed for AI workloads. It simplifies container orchestration to accelerate the development, training, and deployment of AI models.
Installation¶
To use dstack with Verda Cloud, you first need to configure dstack integration by providing access credentials.
- Log in to your Verda Cloud account.
- Navigate to the
Keyspage in the sidebar. - Find
Cloud API credentialsarea and then click the+ Createbutton
Then, configure the backend via ~/.dstack/server/config.yml:
projects:
- name: main
backends:
- type: datacrunch
creds:
type: api_key
client_id: xfaHBqYEsArqhKWX-e52x3HH7w8T
client_secret: B5ZU5Qx9Nt8oGMlmMhNI3iglK8bjMhagTbylZy4WzncZe39995f7Vxh8
See the dstack documentation for more configuration options.
Once the backend is configured, install the dstack server. Follow their official guide at https://dstack.ai/docs/installation/ to install dstack locally.
$ pip install "dstack[datacrunch]" -U
$ dstack server
Applying ~/.dstack/server/config.yml...
The admin token is "bbae0f28-d3dd-4820-bf61-8f4bb40815da"
The server is running at http://127.0.0.1:3000/
Create a fleet¶
Before you can submit your first run, you have to create a fleet.
type: fleet
name: default
# Allow to provision of up to 2 instances
nodes: 0..2
# Deprovision instances above the minimum if they remain idle
idle_duration: 1h
resources:
# Allow to provision up to 8 GPUs
gpu: 0..8
Pass the fleet configuration to dstack apply:
Once the fleet is created, you can run dstack’s dev environments, tasks, and services.
Dev environments¶
A dev environment lets you provision an instance and access it using your desktop IDE (VS Code, Cursor, PyCharm, etc) or via SSH.
Example configuration:
type: dev-environment
name: vscode
# If `image` is not specified, dstack uses its default image
python: "3.12"
#image: dstackai/base:py3.13-0.7-cuda-12.1
ide: vscode
resources:
gpu: B200:1..8
To run a dev environment, apply the configuration:
$ dstack apply -f .dstack.yml
Submit the run vscode? [y/n]: y
To open in VS Code Desktop, use this link:
vscode://vscode-remote/ssh-remote+vscode/workflow
Open the link to access the dev environment from your desktop IDE.
Tasks¶
A task allows you to schedule a job or run a web app. Tasks can be distributed and support port forwarding.
Example training task configuration:
type: task
# The name is optional, if not specified, generated randomly
name: trl-sft
python: 3.12
# Uncomment to use a custom Docker image
#image: huggingface/trl-latest-gpu
env:
- MODEL=Qwen/Qwen2.5-0.5B
- DATASET=stanfordnlp/imdb
commands:
- uv pip install trl
- |
trl sft \
--model_name_or_path $MODEL --dataset_name $DATASET
--num_processes $DSTACK_GPUS_PER_NODE
resources:
# One to two GPUs
gpu: B200:1..2
shm_size: 24GB
To run the task, apply the configuration:
Services¶
A service allows you to deploy a model or any web app as a scalable and secure endpoint.
Example configuration:
type: service
name: deepseek-r1-nvidia
image: lmsysorg/sglang:latest
env:
- MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
commands:
- python3 -m sglang.launch_server
--model-path $MODEL_ID
--port 8000
--trust-remote-code
port: 8000
model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
resources:
gpu: 24GB
To deploy the model, apply the configuration:
$ dstack apply -f deepseek.dstack.yml
Submit the run `deepseek-r1-sglang`? [y/n]: y
Service is published at:
http://localhost:3000/proxy/services/main/deepseek-r1-sglang/
Model deepseek-ai/DeepSeek-R1 is published at:
http://localhost:3000/proxy/models/main/
dstack can handle auto-scaling and authentication if the corresponding properties are set.
If deploying a model, once the service is up, you can access it via dstack’s UI in addition to the API endpoint.
Clusters¶
Managed support for Verda’s instant clusters is coming to dstack. Meanwhile, in case you’ve created clusters with Verda, you can access them via dstack by creating an SSH fleet and listing the IP addresses of each node in the cluster, along with SSH user and SSH private key for each host.
Example configuration:
type: fleet
name: my-datacrunch-cluster
ssh_config:
user: ubuntu
identity_file: ~/.ssh/datacrunch_cluster_id_rsa
hosts:
- hostname: 12.34.567.890
blocks: auto
- hostname: 12.34.567.891
blocks: auto
- hostname: 12.34.567.892
blocks: auto
- hostname: 12.34.567.893
blocks: auto
# Set to `cluster` if the instances are interconnected
placement: cluster
To create the fleet, apply the configuration:
$ dstack apply -f my-datacrunch-fleet.dstack.yml
FLEET INSTANCE RESOURCES STATUS CREATED
my-datacrunch-fleet 0 8xH100 (80GB) 0/8 busy 3 mins ago
1 8xH100 (80GB) 0/8 busy 3 mins ago
2 8xH100 (80GB) 0/8 busy 3 mins ago
3 8xH100 (80GB) 0/8 busy 3 mins ago
Once the fleet is created, you can use it for running dev environments, tasks, and services.
With clusters, it’s possible to run distributed tasks.
Example distributed training task configuration:
type: task
name: train-distrib
# The size of the cluster
nodes: 2
python: 3.12
env:
- NCCL_DEBUG=INFO
commands:
- git clone https://github.com/pytorch/examples.git pytorch-examples
- cd pytorch-examples/distributed/ddp-tutorial-series
- uv pip install -r requirements.txt
- |
torchrun \
--nproc-per-node=$DSTACK_GPUS_PER_NODE \
--node-rank=$DSTACK_NODE_RANK \
--nnodes=$DSTACK_NODES_NUM \
--master-addr=$DSTACK_MASTER_NODE_IP \
--master-port=12345 \
multinode.py 50 10
resources:
gpu: 24GB:1..8
# Uncomment if using multiple GPUs
shm_size: 24GB
To run the task, apply the configuration:
dstack automatically runs the container on each node while passing system environment variables, which you can use with torchrun, accelerate, or other distributed frameworks.
dstack’s Documentation¶
dstack supports a wide range of configurations, not only simplifying the development, training, and deployment of AI models but also optimizing cloud resource usage and reducing costs. Explore dstack’s official documentation for more details and configuration options.