GPU CLI

LLM Inference

Run Ollama, vLLM, or llama.cpp on GPU pods with either local proxy routing or org-scoped hosted publish

LLM Inference

GPU CLI ships a pod-based LLM workflow for Ollama, vLLM, and llama.cpp.

Use it when you want:

  • a local Web UI backed by a remote GPU pod
  • a local forwarded API port
  • an org-scoped hosted model deployment on an existing hosted proxy
  • wake-on-request behavior after the pod cools down
  • a reusable template/session flow without building a serverless endpoint

If you want a deployed RunPod endpoint instead, use Serverless Endpoints.

Choose a Workflow

gpu llm run

The guided path. It launches the LLM wizard, writes a local project directory, and starts the pod-backed service.

gpu llm run

Use --publish local to register the launched service with your local proxy router, or --publish hosted --name <NAME> to create an org-scoped hosted model deployment through the active hosted proxy.

gpu use ollama / gpu use vllm

The direct template path. Use this when you want to work with the official templates yourself.

gpu use ollama
gpu use vllm

Quick Start

Interactive wizard

gpu llm run

The wizard walks you through:

  • choosing Ollama or vLLM
  • selecting a model
  • reviewing the generated template files and launch settings

Direct launch

Skip the wizard when you already know the engine and model.

gpu llm run --ollama --model deepseek-r1:8b -y
gpu llm run --vllm --url meta-llama/Llama-3.1-8B-Instruct -y
gpu llm run --vllm --model Qwen/Qwen2.5-0.5B-Instruct --publish hosted --name staging-qwen -y

Hosted publish currently creates one hosted deployment per model. If you want multiple hosted models, run the command once per model name.

Compatibility aliases

The preferred publish surface is:

  • --publish local
  • --publish hosted --name <NAME>

The older hidden flags remain as temporary compatibility aliases through 0.20.x only:

  • --register-proxy -> --publish local
  • --deployment-name <NAME> -> --publish hosted --name <NAME>

Use the target-based flags in scripts and operator runbooks. The aliases are scheduled for removal in 0.21.0.

Model lookup

gpu llm info deepseek-r1:70b
gpu llm info --url meta-llama/Llama-3.1-8B-Instruct
gpu llm info --url meta-llama/Llama-3.1-8B-Instruct --json

Engine Differences

EngineBest forAPI surfaceNotes
Ollamamulti-model experimentationOllama API plus OpenAI-compatible /v1/* endpointssupports multiple pulled models in one session
vLLMsingle-model high-throughput inferenceOpenAI-compatible APIoptimized for one loaded model at a time

Port Layout

Ollama

PortPurpose
8080Web UI
11434Ollama API and OpenAI-compatible API

vLLM

PortPurpose
8080Web UI
8000OpenAI-compatible API

For local runs, these ports are forwarded to localhost, so your client code talks to local URLs while the model runs on the remote pod.

Hosted publishes use the active hosted proxy's public /v1 base URL instead of a localhost port.

Generated Files and Session State

The LLM workflow generates a local project directory such as llm-ollama or llm-vllm and stores template session state in .gpu/template.json.

That gives you:

  • a reusable generated template
  • normal gpu use resume behavior
  • persistent project-local config and startup files

Wake-on-Request and Persistent Proxy

By default, GPU CLI keeps the local proxy listening after the pod stops. When a new request arrives, GPU CLI resumes the pod and shows a loading page while the service comes back.

Default behavior

  • pod cools down after keep_alive_minutes
  • local forwarded port stays bound
  • a new request resumes the pod
  • requests in the normal stopped/resuming path get a loading page and should retry while the pod comes back

Disable it

For one-off runs, you can opt out:

gpu run --no-persistent-proxy python app.py

Or in config:

{
  "persistent_proxy": false
}

Activity Routing for Long-Lived Apps

Polling-heavy apps can stay awake forever if every request counts as activity. Use rich ports rules to define which requests reset the cooldown timer.

{
  "keep_alive_minutes": 20,
  "persistent_proxy": true,
  "ports": [
    {
      "port": 8080,
      "description": "ui",
      "http": {
        "activity_paths": ["/api/chat", "/api/generate"],
        "ignore_paths": ["/health", "/queue", "/metrics"],
        "ignore_methods": ["OPTIONS", "HEAD"]
      },
      "websocket": {
        "data_frames_are_activity": true,
        "ping_pong_is_activity": false
      }
    }
  ]
}

Rules of thumb

  • Use activity_paths for the requests that represent real user work.
  • Put health checks, queue polling, and metrics in ignore_paths.
  • Leave WebSocket ping/pong as non-activity.
  • Keep health_check_paths aligned with any service-level health routes.

Storage Notes

LLM workflows are built around persistent storage so large model downloads survive restarts. That is the main reason the generated flows feel faster after the first run.

Use Configuration if you need to override volume behavior.

Calling the APIs

Local Ollama example

curl http://127.0.0.1:11434/api/tags

Local vLLM example

curl http://127.0.0.1:8000/v1/models

OpenAI SDK against local vLLM

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "not-used-locally",
  baseURL: "http://127.0.0.1:8000/v1",
});

OpenAI SDK against a hosted publish

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.GPU_PROXY_KEY!,
  baseURL: "https://your-hosted-proxy.gpu-cli.sh/v1",
});

When to Use Serverless Instead

Choose Serverless Endpoints when you need:

  • a deployed RunPod endpoint URL
  • endpoint autoscaling instead of a pod-backed local proxy
  • a production-facing remote API surface

Stay with gpu llm when you want interactive work, local forwarding, and wake-on-request behavior.

On this page