Run Ollama, vLLM, or llama.cpp on GPU pods with either local proxy routing or org-scoped hosted publish

LLM Inference

GPU CLI ships a pod-based LLM workflow for Ollama, vLLM, and llama.cpp.

Use it when you want:

a local Web UI backed by a remote GPU pod
a local forwarded API port
an org-scoped hosted model deployment on an existing hosted proxy
wake-on-request behavior after the pod cools down
a reusable template/session flow without building a serverless endpoint

If you want a deployed RunPod endpoint instead, use Serverless Endpoints.

Choose a Workflow

`gpu llm run`

The guided path. It launches the LLM wizard, writes a local project directory, and starts the pod-backed service.

gpu llm run

Use --publish local to register the launched service with your local proxy router, or --publish hosted --name <NAME> to create an org-scoped hosted model deployment through the active hosted proxy.

`gpu use ollama` / `gpu use vllm`

The direct template path. Use this when you want to work with the official templates yourself.

gpu use ollama
gpu use vllm

Quick Start

Interactive wizard

gpu llm run

The wizard walks you through:

choosing Ollama or vLLM
selecting a model
reviewing the generated template files and launch settings

Direct launch

Skip the wizard when you already know the engine and model.

gpu llm run --ollama --model deepseek-r1:8b -y
gpu llm run --vllm --url meta-llama/Llama-3.1-8B-Instruct -y
gpu llm run --vllm --model Qwen/Qwen2.5-0.5B-Instruct --publish hosted --name staging-qwen -y

Hosted publish currently creates one hosted deployment per model. If you want multiple hosted models, run the command once per model name.

Compatibility aliases

The preferred publish surface is:

--publish local
--publish hosted --name <NAME>

The older hidden flags remain as temporary compatibility aliases through 0.20.x only:

--register-proxy -> --publish local
--deployment-name <NAME> -> --publish hosted --name <NAME>

Use the target-based flags in scripts and operator runbooks. The aliases are scheduled for removal in 0.21.0.

Model lookup

gpu llm info deepseek-r1:70b
gpu llm info --url meta-llama/Llama-3.1-8B-Instruct
gpu llm info --url meta-llama/Llama-3.1-8B-Instruct --json

Engine Differences

Engine	Best for	API surface	Notes
Ollama	multi-model experimentation	Ollama API plus OpenAI-compatible `/v1/*` endpoints	supports multiple pulled models in one session
vLLM	single-model high-throughput inference	OpenAI-compatible API	optimized for one loaded model at a time

Port Layout

Ollama

Port	Purpose
`8080`	Web UI
`11434`	Ollama API and OpenAI-compatible API

vLLM

Port	Purpose
`8080`	Web UI
`8000`	OpenAI-compatible API

For local runs, these ports are forwarded to localhost, so your client code talks to local URLs while the model runs on the remote pod.

Hosted publishes use the active hosted proxy's public /v1 base URL instead of a localhost port.

Generated Files and Session State

The LLM workflow generates a local project directory such as llm-ollama or llm-vllm and stores template session state in .gpu/template.json.

That gives you:

a reusable generated template
normal gpu use resume behavior
persistent project-local config and startup files

Wake-on-Request and Persistent Proxy

By default, GPU CLI keeps the local proxy listening after the pod stops. When a new request arrives, GPU CLI resumes the pod and shows a loading page while the service comes back.

Default behavior

pod cools down after keep_alive_minutes
local forwarded port stays bound
a new request resumes the pod
requests in the normal stopped/resuming path get a loading page and should retry while the pod comes back

Disable it

For one-off runs, you can opt out:

gpu run --no-persistent-proxy python app.py

Or in config:

{
  "persistent_proxy": false
}

Activity Routing for Long-Lived Apps

Polling-heavy apps can stay awake forever if every request counts as activity. Use rich ports rules to define which requests reset the cooldown timer.

{
  "keep_alive_minutes": 20,
  "persistent_proxy": true,
  "ports": [
    {
      "port": 8080,
      "description": "ui",
      "http": {
        "activity_paths": ["/api/chat", "/api/generate"],
        "ignore_paths": ["/health", "/queue", "/metrics"],
        "ignore_methods": ["OPTIONS", "HEAD"]
      },
      "websocket": {
        "data_frames_are_activity": true,
        "ping_pong_is_activity": false
      }
    }
  ]
}

Rules of thumb

Use activity_paths for the requests that represent real user work.
Put health checks, queue polling, and metrics in ignore_paths.
Leave WebSocket ping/pong as non-activity.
Keep health_check_paths aligned with any service-level health routes.

curl http://127.0.0.1:11434/api/tags

Local vLLM example

curl http://127.0.0.1:8000/v1/models

OpenAI SDK against local vLLM

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "not-used-locally",
  baseURL: "http://127.0.0.1:8000/v1",
});

OpenAI SDK against a hosted publish

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.GPU_PROXY_KEY!,
  baseURL: "https://your-hosted-proxy.gpu-cli.sh/v1",
});

When to Use Serverless Instead

Choose Serverless Endpoints when you need:

a deployed RunPod endpoint URL
endpoint autoscaling instead of a pod-backed local proxy
a production-facing remote API surface

Stay with gpu llm when you want interactive work, local forwarding, and wake-on-request behavior.

LLM Inference

LLM Inference

Choose a Workflow

`gpu llm run`

`gpu use ollama` / `gpu use vllm`

Quick Start

Interactive wizard

Direct launch

Compatibility aliases

Model lookup

Engine Differences

Port Layout

Ollama

vLLM

Generated Files and Session State

Wake-on-Request and Persistent Proxy

Default behavior

Disable it

Activity Routing for Long-Lived Apps

Rules of thumb

Storage Notes

Calling the APIs

Local Ollama example

Local vLLM example

OpenAI SDK against local vLLM

OpenAI SDK against a hosted publish

When to Use Serverless Instead

On this page