Serverless Endpoints
Deploy and manage RunPod Serverless endpoints with GPU CLI
Serverless Endpoints
GPU CLI wraps RunPod Serverless with a config-first workflow for ComfyUI, vLLM, Whisper, and custom worker images.
Use this guide when you want a deployed RunPod endpoint. If you want a pod-based LLM workflow with a local Web UI, local API ports, and wake-on-request routing, use LLM Inference instead.
Prerequisites
- GPU CLI installed and authenticated
- RunPod API key configured with
gpu auth login
Quick Start
1. Create a serverless block
{
"$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
"serverless": {
"template": "vllm",
"gpu_type": "NVIDIA A100 80GB PCIe",
"scaling": {
"min_workers": 0,
"max_workers": 3,
"idle_timeout": 5
}
}
}2. Deploy
gpu serverless deployGPU CLI resolves the official template, applies your config, and creates or updates the RunPod endpoint.
3. Call the endpoint
curl https://api.runpod.ai/v2/{endpoint-id}/runsync \
-H "Authorization: Bearer $RUNPOD_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input": {"prompt": "a cat in space"}}'Templates
| Template | Use case |
|---|---|
auto | Auto-detect from project files |
comfyui | Image generation workflows |
vllm | OpenAI-compatible LLM inference |
whisper | Audio transcription |
custom-image | Custom worker image |
Configuration
The full schema lives in Configuration. The most important fields are:
serverless.templateserverless.gpu_typeserverless.gpu_typesserverless.scalingserverless.volumeserverless.runpod
Example: ComfyUI endpoint
{
"$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
"serverless": {
"template": "comfyui",
"gpu_type": "NVIDIA GeForce RTX 4090",
"scaling": {
"min_workers": 0,
"max_workers": 5,
"idle_timeout": 10
},
"volume": {
"name": "comfyui-models",
"size_gb": 200
},
"runpod": {
"flashboot": true
}
}
}Example: vLLM endpoint
{
"$schema": "https://gpu-cli.sh/schema/v1/gpu.json",
"serverless": {
"template": "vllm",
"gpu_type": "NVIDIA A100 80GB PCIe",
"gpu_types": ["NVIDIA L4"],
"scaling": {
"min_workers": 1,
"max_workers": 3,
"idle_timeout": 30
},
"runpod": {
"cached_model": "meta-llama/Llama-3.1-8B-Instruct",
"env": {
"MODEL_NAME": "meta-llama/Llama-3.1-8B-Instruct"
}
}
}
}Managing Endpoints
List
gpu serverless list
gpu serverless list --all
gpu serverless list --jsonStatus
Treat status as endpoint-ID-driven today:
gpu serverless status <ENDPOINT_ID>
gpu serverless status <ENDPOINT_ID> --jsonDelete
Delete accepts an endpoint ID or name. If you omit it, the CLI shows an interactive picker.
gpu serverless delete
gpu serverless delete my-endpoint --forceWarm
GPU warmup is the supported path today:
gpu serverless warm <ENDPOINT_ID> --gpu
gpu serverless warm <ENDPOINT_ID> --timeout 900Template delete
gpu serverless template delete
gpu serverless template delete tmpl_123 --forceLogs
gpu serverless logs <ENDPOINT_ID>gpu serverless logs currently points you to the RunPod dashboard rather than streaming filtered logs inside the CLI.
Current Limitations
gpu serverless warm --cpuis not implemented in the current runtime.- Deploy-time
gpu serverless deploy --warm ...is accepted by the CLI but not wired through the runtime. - Deploy-time
gpu serverless deploy --write-ids ...is accepted by the CLI but not wired through the runtime. serverless.prewarmexists in the schema but should be treated as reserved configuration today.
Calling a vLLM Endpoint
TypeScript
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.RUNPOD_API_KEY,
baseURL: `https://api.runpod.ai/v2/${process.env.ENDPOINT_ID}/openai/v1`,
});
const response = await client.chat.completions.create({
model: "meta-llama/Llama-3.1-8B-Instruct",
messages: [{ role: "user", content: "Hello!" }],
stream: true,
});
for await (const chunk of response) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}Python
from openai import OpenAI
import os
client = OpenAI(
api_key=os.environ["RUNPOD_API_KEY"],
base_url=f"https://api.runpod.ai/v2/{os.environ['ENDPOINT_ID']}/openai/v1",
)Next Steps
- LLM Inference — Pod-based Ollama and vLLM with wake-on-request routing
- Configuration Reference — Full serverless config
- Commands Reference — CLI flags and limitations
- Troubleshooting — Current caveats and recovery steps