Skip to content

doktoren/s2st-service

Repository files navigation

Speech-to-Speech Translation Service

TL;DR: Use Azure Speech Translation

Warning: Large parts of this repository has been vibe coded using the coding agent codex from OpenAI and large parts have been reviewed in detail.

Watch 12 minute presentation where I first justify why I made it and then demonstrates it in action.

This repository hosts a small FastAPI based server that exposes Meta's SeamlessM4T-V2-Large model (Hugging Face) over a WebSocket interface.

This demo page lets you record some speech and have it translated using SeamlessM4T (probably a V1 model).

The goal of this repository is to run the full speech-to-speech translation pipeline locally without depending on any third-party cloud service. Clients stream audio frames to the /ws endpoint and receive translated audio in real time.

Unfortunately the model is licensed under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) - Non-commercial use only

Also, the model had problems keeping up with realtime speed on my local hardware (AMD RX7900XTX).

For those reasons I aborted the Seamless model and went with a commercial alternative instead: Azure Speech Translation

Documentation below is mostly AI generated.

Features

  • Accepts audio in g711_ulaw or pcm16 with sample rates of 8, 16 or 24 kHz.
  • Ingestion strategy: send 20 ms frames and let the server perform VAD based segmentation.
  • Streams back translated audio chunks followed by an end_of_audio marker.
  • Python 3.13 with strict typing, ruff linting and mypy type checking.
  • Runs on CPU only or on ROCm 6.3 GPUs.

Installing dependencies

This project uses uv for dependency management. First install uv and then choose a backend:

# CPU only
uv sync --extra cpu

# AMD ROCm 6.3 GPU
uv sync --extra gpu-rocm

Development tools such as ruff and mypy live in the dev dependency group. Install them when contributing code:

uv pip install --group dev

Intended streaming behavior

  • Once the test page is started it opens a WebSocket connection and streams audio in 20 ms chunks to the server.
  • The backend performs voice activity detection on the incoming frames.
  • When a pause is detected, the buffered speech is forwarded to the model for translation.
  • As soon as translated audio is available-even partially-the server streams it back to the browser. The server may send audio faster than real time; the client should play it at normal speed until finished.
  • While a translation is running the server should avoid beginning another one by pausing WebSocket reads or otherwise ensuring that only one translation is in flight.
  • Clicking stop on the test page should send roughly one second of silence so the final segment triggers VAD processing.

Development workflow

  • ./lint.sh - install dev dependencies, run ruff and mypy.
  • The translation implementation lives in app/translate.py.
  • run_translate.sh starts the server with a single worker and verbose logging.
  • The vad implementation lives in app/vad.py.
  • run_vad.sh starts the server with a single worker and verbose logging.

Please ensure that code remains thoroughly typed and linted and that documentation is updated alongside code changes.

Target platform

The author has focused on his own platform and in some cases even hard coded values:

  • Operating system: Ubuntu 24.04.3 LTS
  • GPU: Sapphire Radeon RX7900XTX Pulse Gaming OC 24B
  • CPU: 11th Gen Intel Core i5-1135G7 x 8
  • RAM: 48 GB

The test server (cd test && ./run.sh) assumes that endpoints are exposed using the authors tailscale DNS name:

./tailscale_serve.sh

About

A websocket based translation service using SeamlessM4T‑V2. Vibe coded using OpenAI Codex

Resources

Stars

Watchers

Forks