VoxServe is a serving system for Speech Language Models (SpeechLMs). VoxServe provides low-latency & high-throughput inference for language models trained for speech tokens, specifically text-to-speech (TTS) and speech-to-speech (STS) models.
- [2025-02] We released our paper: VoxServe: A Streaming-Centric Serving System for Speech Language Models
You can install VoxServe via pip:
pip install vox-serve
vox-serve --model <model-name> --port <port-number>Or, you can clone the code and start the inference server with launch.py:
git clone https://github.com/vox-serve/vox-serve.git
cd vox-serve
python -m vox_serve.launch --model <model-name> --port <port-number>And call the server like this:
# Generate audio from text
curl -X POST "http://localhost:<port-number>/generate" -F "text=Hello world" -F "streaming=true" -o output.wav
# For models supporting audio input
curl -X POST "http://localhost:<port-number>/generate" -F "text=Hello world" -F "@input.wav" -F "streaming=true" -o output.wavWe currently support the following TTS and STS models:
chatterbox: Chatterbox TTScosyvoice2: CosyVoice2-0.5Bcsm: CSM-1Borpheus: Orpheus-3Bqwen3-tts: Qwen3-TTS-1.7B (custom voice mode only, other modes under development)zonos: Zonos-v0.1glm: GLM-4-Voice-9Bstep: Step-Audio-2-Mini
And we are actively working on expanding the support.
./examples folder has more example usage.
