Voxtral-4B-TTS: expose speaker encoder / add_voice_from_audio for cloning from arbitrary reference WAVs

## Problem

Voxtral upstream (Mistral) supports zero-shot multilingual voice cloning per the [arXiv paper](https://arxiv.org/abs/2402.04788) ("Voxtral TTS is preferred for multilingual voice cloning due to its naturalness"), but mlx-audio's port (added in #606) currently only exposes the **20 precomputed voice presets** that ship with the model — it has no public API to compute a fresh speaker embedding from an arbitrary reference WAV.

Concretely, in `mlx_audio/tts/models/voxtral_tts/voxtral_tts.py`:

- `post_load_hook` registers the `voice_embedding/*.safetensors` files shipped in the model repo as the universe of usable voices.
- `_get_voice_embedding(voice: str) -> mx.array | None` returns the embedding for one of those preset names; it returns `None` for any other name.
- The public `Model.generate(text, voice='casual_male', ...)` signature only accepts a preset name.

There's no exposed path that takes a `ref_audio_path` (or `mx.array` waveform), runs it through Voxtral's speaker encoder, and registers the result as a callable voice — even though that capability appears to be intrinsic to the model (the embedding format used internally would presumably accept any vector of the right shape produced by the right encoder).

## Why this matters

Downstream projects that want to use Voxtral for cloning (rather than just preset playback) have no API to call. Specifically: I'm building a local TTS server ([afterwords](https://github.com/adrianwedd/afterwords)) that hosts 56 cloned voices across 4 backends (Qwen3 0.6B/1.7B, Chatterbox, VoxCPM 1.5). Each backend takes a 15s reference WAV and produces a usable voice. Voxtral was the obvious 5th backend to add for its multilingual coverage and the cloning quality the paper claims — but with the current API, the only thing we can do with Voxtral is play back the 20 stock presets, which doesn't fit the project's per-voice profile model.

## What would unblock this

Any of the following, in increasing order of effort:

1. **Document where the gap is.** A README/docstring note clarifying that Voxtral cloning is upstream-supported but not currently exposed by mlx-audio's bindings, with pointers to where the speaker encoder lives in the Mistral checkpoint, would let downstream projects make informed decisions.

2. **Expose the speaker encoder as a separate callable.** Even without a fully wired `add_voice` API, a `model.encode_speaker(audio: mx.array, sample_rate: int) -> mx.array` that returns an embedding of the right shape would let downstream projects manage the voice dictionary themselves.

3. **Add a public `register_voice` / `add_voice_from_audio` API.** The most direct fix: a method that takes a name + reference audio, runs the encoder, and inserts into `_voice_embeddings` (or writes a `.safetensors` file). Then the existing `generate(voice=name)` path works for arbitrary clips.

I'm happy to contribute a PR for #2 or #3 if a maintainer can point me at the encoder. From a quick read of the source it's not obvious whether the speaker encoder lives in the Voxtral checkpoint itself (and we just need to wire its forward pass) or whether it's an external model (e.g. a wav2vec/HuBERT/x-vector style speaker encoder) that produced the shipped embeddings offline. That distinction determines the right shape of the patch.

## Environment

- mlx-audio 0.4.2
- Apple Silicon M1, 32 GB
- Voxtral model: `mlx-community/Voxtral-4B-TTS-2603-mlx-bf16`
- Loading + preset playback work fine (#606 landed cleanly); the gap is exclusively on the cloning ingress side.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Voxtral-4B-TTS: expose speaker encoder / add_voice_from_audio for cloning from arbitrary reference WAVs #694

Problem

Why this matters

What would unblock this

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Voxtral-4B-TTS: expose speaker encoder / add_voice_from_audio for cloning from arbitrary reference WAVs #694

Description

Problem

Why this matters

What would unblock this

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions