Serverless Endpoints

GPU CLI wraps RunPod Serverless with a config-first workflow for ComfyUI, vLLM, Whisper, and custom worker images.

Use this guide when you want a deployed RunPod endpoint. If you want a pod-based LLM workflow with a local Web UI, local API ports, and wake-on-request routing, use LLM Inference instead.

Prerequisites

GPU CLI installed and authenticated
RunPod API key configured with gpu auth login

Quick Start

1. Create a `serverless` block

{
  "$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
  "serverless": {
    "template": "vllm",
    "gpu_type": "NVIDIA A100 80GB PCIe",
    "scaling": {
      "min_workers": 0,
      "max_workers": 3,
      "idle_timeout": 5
    }
  }
}

2. Deploy

gpu serverless deploy

GPU CLI resolves the official template, applies your config, and creates or updates the RunPod endpoint.

3. Call the endpoint

curl https://api.runpod.ai/v2/{endpoint-id}/runsync \
  -H "Authorization: Bearer $RUNPOD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input": {"prompt": "a cat in space"}}'

Templates

Template	Use case
`auto`	Auto-detect from project files
`comfyui`	Image generation workflows
`vllm`	OpenAI-compatible LLM inference
`whisper`	Audio transcription
`custom-image`	Custom worker image

Configuration

The full schema lives in Configuration. The most important fields are:

serverless.template
serverless.gpu_type
serverless.gpu_types
serverless.scaling
serverless.volume
serverless.runpod

Example: ComfyUI endpoint

{
  "$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
  "serverless": {
    "template": "comfyui",
    "gpu_type": "NVIDIA GeForce RTX 4090",
    "scaling": {
      "min_workers": 0,
      "max_workers": 5,
      "idle_timeout": 10
    },
    "volume": {
      "name": "comfyui-models",
      "size_gb": 200
    },
    "runpod": {
      "flashboot": true
    }
  }
}

Example: vLLM endpoint

{
  "$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
  "serverless": {
    "template": "vllm",
    "gpu_type": "NVIDIA A100 80GB PCIe",
    "gpu_types": ["NVIDIA L4"],
    "scaling": {
      "min_workers": 1,
      "max_workers": 3,
      "idle_timeout": 30
    },
    "runpod": {
      "cached_model": "meta-llama/Llama-3.1-8B-Instruct",
      "env": {
        "MODEL_NAME": "meta-llama/Llama-3.1-8B-Instruct"
      }
    }
  }
}

Managing Endpoints

List

gpu serverless list
gpu serverless list --all
gpu serverless list --json

Status

Treat status as endpoint-ID-driven today:

gpu serverless status <ENDPOINT_ID>
gpu serverless status <ENDPOINT_ID> --json

Delete

Delete accepts an endpoint ID or name. If you omit it, the CLI shows an interactive picker.

gpu serverless delete
gpu serverless delete my-endpoint --force

Warm

GPU warmup is the supported path today:

gpu serverless warm <ENDPOINT_ID> --gpu
gpu serverless warm <ENDPOINT_ID> --timeout 900

Template delete

gpu serverless template delete
gpu serverless template delete tmpl_123 --force

Logs

gpu serverless logs <ENDPOINT_ID>

gpu serverless logs currently points you to the RunPod dashboard rather than streaming filtered logs inside the CLI.

Current Limitations

gpu serverless warm --cpu is not implemented in the current runtime.
Deploy-time gpu serverless deploy --warm ... is accepted by the CLI but not wired through the runtime.
Deploy-time gpu serverless deploy --write-ids ... is accepted by the CLI but not wired through the runtime.
serverless.prewarm exists in the schema but should be treated as reserved configuration today.

Calling a vLLM Endpoint

TypeScript

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.RUNPOD_API_KEY,
  baseURL: `https://api.runpod.ai/v2/${process.env.ENDPOINT_ID}/openai/v1`,
});

const response = await client.chat.completions.create({
  model: "meta-llama/Llama-3.1-8B-Instruct",
  messages: [{ role: "user", content: "Hello!" }],
  stream: true,
});

for await (const chunk of response) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Python

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["RUNPOD_API_KEY"],
    base_url=f"https://api.runpod.ai/v2/{os.environ['ENDPOINT_ID']}/openai/v1",
)

Next Steps

LLM Inference — Pod-based Ollama and vLLM with wake-on-request routing
Configuration Reference — Full serverless config
Commands Reference — CLI flags and limitations
Troubleshooting — Current caveats and recovery steps

Serverless Endpoints

On this page