LLM Inference
Run Ollama, vLLM, or llama.cpp on GPU pods with either local proxy routing or org-scoped hosted publish
LLM Inference
GPU CLI ships a pod-based LLM workflow for Ollama, vLLM, and llama.cpp.
Use it when you want:
- a local Web UI backed by a remote GPU pod
- a local forwarded API port
- an org-scoped hosted model deployment on an existing hosted proxy
- wake-on-request behavior after the pod cools down
- a reusable template/session flow without building a serverless endpoint
If you want a deployed RunPod endpoint instead, use Serverless Endpoints.
Choose a Workflow
gpu llm run
The guided path. It launches the LLM wizard, writes a local project directory, and starts the pod-backed service.
gpu llm runUse --publish local to register the launched service with your local proxy router, or --publish hosted --name <NAME> to create an org-scoped hosted model deployment through the active hosted proxy.
gpu use ollama / gpu use vllm
The direct template path. Use this when you want to work with the official templates yourself.
gpu use ollama
gpu use vllmQuick Start
Interactive wizard
gpu llm runThe wizard walks you through:
- choosing Ollama or vLLM
- selecting a model
- reviewing the generated template files and launch settings
Direct launch
Skip the wizard when you already know the engine and model.
gpu llm run --ollama --model deepseek-r1:8b -y
gpu llm run --vllm --url meta-llama/Llama-3.1-8B-Instruct -y
gpu llm run --vllm --model Qwen/Qwen2.5-0.5B-Instruct --publish hosted --name staging-qwen -yHosted publish currently creates one hosted deployment per model. If you want multiple hosted models, run the command once per model name.
Compatibility aliases
The preferred publish surface is:
--publish local--publish hosted --name <NAME>
The older hidden flags remain as temporary compatibility aliases through 0.20.x only:
--register-proxy->--publish local--deployment-name <NAME>->--publish hosted --name <NAME>
Use the target-based flags in scripts and operator runbooks. The aliases are scheduled for removal in 0.21.0.
Model lookup
gpu llm info deepseek-r1:70b
gpu llm info --url meta-llama/Llama-3.1-8B-Instruct
gpu llm info --url meta-llama/Llama-3.1-8B-Instruct --jsonEngine Differences
| Engine | Best for | API surface | Notes |
|---|---|---|---|
| Ollama | multi-model experimentation | Ollama API plus OpenAI-compatible /v1/* endpoints | supports multiple pulled models in one session |
| vLLM | single-model high-throughput inference | OpenAI-compatible API | optimized for one loaded model at a time |
Port Layout
Ollama
| Port | Purpose |
|---|---|
8080 | Web UI |
11434 | Ollama API and OpenAI-compatible API |
vLLM
| Port | Purpose |
|---|---|
8080 | Web UI |
8000 | OpenAI-compatible API |
For local runs, these ports are forwarded to localhost, so your client code talks to local URLs while the model runs on the remote pod.
Hosted publishes use the active hosted proxy's public /v1 base URL instead of a localhost port.
Generated Files and Session State
The LLM workflow generates a local project directory such as llm-ollama or llm-vllm and stores template session state in .gpu/template.json.
That gives you:
- a reusable generated template
- normal
gpu useresume behavior - persistent project-local config and startup files
Wake-on-Request and Persistent Proxy
By default, GPU CLI keeps the local proxy listening after the pod stops. When a new request arrives, GPU CLI resumes the pod and shows a loading page while the service comes back.
Default behavior
- pod cools down after
keep_alive_minutes - local forwarded port stays bound
- a new request resumes the pod
- requests in the normal stopped/resuming path get a loading page and should retry while the pod comes back
Disable it
For one-off runs, you can opt out:
gpu run --no-persistent-proxy python app.pyOr in config:
{
"persistent_proxy": false
}Activity Routing for Long-Lived Apps
Polling-heavy apps can stay awake forever if every request counts as activity. Use rich ports rules to define which requests reset the cooldown timer.
{
"keep_alive_minutes": 20,
"persistent_proxy": true,
"ports": [
{
"port": 8080,
"description": "ui",
"http": {
"activity_paths": ["/api/chat", "/api/generate"],
"ignore_paths": ["/health", "/queue", "/metrics"],
"ignore_methods": ["OPTIONS", "HEAD"]
},
"websocket": {
"data_frames_are_activity": true,
"ping_pong_is_activity": false
}
}
]
}Rules of thumb
- Use
activity_pathsfor the requests that represent real user work. - Put health checks, queue polling, and metrics in
ignore_paths. - Leave WebSocket ping/pong as non-activity.
- Keep
health_check_pathsaligned with any service-level health routes.
Storage Notes
LLM workflows are built around persistent storage so large model downloads survive restarts. That is the main reason the generated flows feel faster after the first run.
Use Configuration if you need to override volume behavior.
Calling the APIs
Local Ollama example
curl http://127.0.0.1:11434/api/tagsLocal vLLM example
curl http://127.0.0.1:8000/v1/modelsOpenAI SDK against local vLLM
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "not-used-locally",
baseURL: "http://127.0.0.1:8000/v1",
});OpenAI SDK against a hosted publish
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.GPU_PROXY_KEY!,
baseURL: "https://your-hosted-proxy.gpu-cli.sh/v1",
});When to Use Serverless Instead
Choose Serverless Endpoints when you need:
- a deployed RunPod endpoint URL
- endpoint autoscaling instead of a pod-backed local proxy
- a production-facing remote API surface
Stay with gpu llm when you want interactive work, local forwarding, and wake-on-request behavior.