TL;DR: Use Azure Speech Translation
Warning: Large parts of this repository has been vibe coded using the coding agent codex from OpenAI and large parts have been reviewed in detail.
Watch 12 minute presentation where I first justify why I made it and then demonstrates it in action.
This repository hosts a small FastAPI based server that exposes Meta's SeamlessM4T-V2-Large model (Hugging Face) over a WebSocket interface.
This demo page lets you record some speech and have it translated using SeamlessM4T (probably a V1 model).
The goal of this repository is to run the full speech-to-speech translation pipeline locally without depending on
any third-party cloud service. Clients stream audio frames to the /ws
endpoint and receive translated audio in real time.
Unfortunately the model is licensed under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) - Non-commercial use only
Also, the model had problems keeping up with realtime speed on my local hardware (AMD RX7900XTX).
For those reasons I aborted the Seamless model and went with a commercial alternative instead: Azure Speech Translation
Documentation below is mostly AI generated.
- Accepts audio in
g711_ulaworpcm16with sample rates of 8, 16 or 24 kHz. - Ingestion strategy: send 20 ms frames and let the server perform VAD based segmentation.
- Streams back translated audio chunks followed by an
end_of_audiomarker. - Python 3.13 with strict typing,
rufflinting andmypytype checking. - Runs on CPU only or on ROCm 6.3 GPUs.
This project uses uv for dependency
management. First install uv and then choose a backend:
# CPU only
uv sync --extra cpu
# AMD ROCm 6.3 GPU
uv sync --extra gpu-rocmDevelopment tools such as ruff and mypy live in the dev dependency
group. Install them when contributing code:
uv pip install --group dev- Once the test page is started it opens a WebSocket connection and streams audio in 20 ms chunks to the server.
- The backend performs voice activity detection on the incoming frames.
- When a pause is detected, the buffered speech is forwarded to the model for translation.
- As soon as translated audio is available-even partially-the server streams it back to the browser. The server may send audio faster than real time; the client should play it at normal speed until finished.
- While a translation is running the server should avoid beginning another one by pausing WebSocket reads or otherwise ensuring that only one translation is in flight.
- Clicking stop on the test page should send roughly one second of silence so the final segment triggers VAD processing.
./lint.sh- install dev dependencies, runruffandmypy.- The translation implementation lives in
app/translate.py. run_translate.shstarts the server with a single worker and verbose logging.- The vad implementation lives in
app/vad.py. run_vad.shstarts the server with a single worker and verbose logging.
Please ensure that code remains thoroughly typed and linted and that documentation is updated alongside code changes.
The author has focused on his own platform and in some cases even hard coded values:
- Operating system: Ubuntu 24.04.3 LTS
- GPU: Sapphire Radeon RX7900XTX Pulse Gaming OC 24B
- CPU: 11th Gen Intel Core i5-1135G7 x 8
- RAM: 48 GB
The test server (cd test && ./run.sh) assumes that endpoints are exposed using the authors tailscale DNS name:
./tailscale_serve.sh