You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Voxtral upstream (Mistral) supports zero-shot multilingual voice cloning per the arXiv paper ("Voxtral TTS is preferred for multilingual voice cloning due to its naturalness"), but mlx-audio's port (added in #606) currently only exposes the 20 precomputed voice presets that ship with the model — it has no public API to compute a fresh speaker embedding from an arbitrary reference WAV.
Concretely, in mlx_audio/tts/models/voxtral_tts/voxtral_tts.py:
post_load_hook registers the voice_embedding/*.safetensors files shipped in the model repo as the universe of usable voices.
_get_voice_embedding(voice: str) -> mx.array | None returns the embedding for one of those preset names; it returns None for any other name.
The public Model.generate(text, voice='casual_male', ...) signature only accepts a preset name.
There's no exposed path that takes a ref_audio_path (or mx.array waveform), runs it through Voxtral's speaker encoder, and registers the result as a callable voice — even though that capability appears to be intrinsic to the model (the embedding format used internally would presumably accept any vector of the right shape produced by the right encoder).
Why this matters
Downstream projects that want to use Voxtral for cloning (rather than just preset playback) have no API to call. Specifically: I'm building a local TTS server (afterwords) that hosts 56 cloned voices across 4 backends (Qwen3 0.6B/1.7B, Chatterbox, VoxCPM 1.5). Each backend takes a 15s reference WAV and produces a usable voice. Voxtral was the obvious 5th backend to add for its multilingual coverage and the cloning quality the paper claims — but with the current API, the only thing we can do with Voxtral is play back the 20 stock presets, which doesn't fit the project's per-voice profile model.
What would unblock this
Any of the following, in increasing order of effort:
Document where the gap is. A README/docstring note clarifying that Voxtral cloning is upstream-supported but not currently exposed by mlx-audio's bindings, with pointers to where the speaker encoder lives in the Mistral checkpoint, would let downstream projects make informed decisions.
Expose the speaker encoder as a separate callable. Even without a fully wired add_voice API, a model.encode_speaker(audio: mx.array, sample_rate: int) -> mx.array that returns an embedding of the right shape would let downstream projects manage the voice dictionary themselves.
Add a public register_voice / add_voice_from_audio API. The most direct fix: a method that takes a name + reference audio, runs the encoder, and inserts into _voice_embeddings (or writes a .safetensors file). Then the existing generate(voice=name) path works for arbitrary clips.
I'm happy to contribute a PR for #2 or #3 if a maintainer can point me at the encoder. From a quick read of the source it's not obvious whether the speaker encoder lives in the Voxtral checkpoint itself (and we just need to wire its forward pass) or whether it's an external model (e.g. a wav2vec/HuBERT/x-vector style speaker encoder) that produced the shipped embeddings offline. That distinction determines the right shape of the patch.
Problem
Voxtral upstream (Mistral) supports zero-shot multilingual voice cloning per the arXiv paper ("Voxtral TTS is preferred for multilingual voice cloning due to its naturalness"), but mlx-audio's port (added in #606) currently only exposes the 20 precomputed voice presets that ship with the model — it has no public API to compute a fresh speaker embedding from an arbitrary reference WAV.
Concretely, in
mlx_audio/tts/models/voxtral_tts/voxtral_tts.py:post_load_hookregisters thevoice_embedding/*.safetensorsfiles shipped in the model repo as the universe of usable voices._get_voice_embedding(voice: str) -> mx.array | Nonereturns the embedding for one of those preset names; it returnsNonefor any other name.Model.generate(text, voice='casual_male', ...)signature only accepts a preset name.There's no exposed path that takes a
ref_audio_path(ormx.arraywaveform), runs it through Voxtral's speaker encoder, and registers the result as a callable voice — even though that capability appears to be intrinsic to the model (the embedding format used internally would presumably accept any vector of the right shape produced by the right encoder).Why this matters
Downstream projects that want to use Voxtral for cloning (rather than just preset playback) have no API to call. Specifically: I'm building a local TTS server (afterwords) that hosts 56 cloned voices across 4 backends (Qwen3 0.6B/1.7B, Chatterbox, VoxCPM 1.5). Each backend takes a 15s reference WAV and produces a usable voice. Voxtral was the obvious 5th backend to add for its multilingual coverage and the cloning quality the paper claims — but with the current API, the only thing we can do with Voxtral is play back the 20 stock presets, which doesn't fit the project's per-voice profile model.
What would unblock this
Any of the following, in increasing order of effort:
Document where the gap is. A README/docstring note clarifying that Voxtral cloning is upstream-supported but not currently exposed by mlx-audio's bindings, with pointers to where the speaker encoder lives in the Mistral checkpoint, would let downstream projects make informed decisions.
Expose the speaker encoder as a separate callable. Even without a fully wired
add_voiceAPI, amodel.encode_speaker(audio: mx.array, sample_rate: int) -> mx.arraythat returns an embedding of the right shape would let downstream projects manage the voice dictionary themselves.Add a public
register_voice/add_voice_from_audioAPI. The most direct fix: a method that takes a name + reference audio, runs the encoder, and inserts into_voice_embeddings(or writes a.safetensorsfile). Then the existinggenerate(voice=name)path works for arbitrary clips.I'm happy to contribute a PR for #2 or #3 if a maintainer can point me at the encoder. From a quick read of the source it's not obvious whether the speaker encoder lives in the Voxtral checkpoint itself (and we just need to wire its forward pass) or whether it's an external model (e.g. a wav2vec/HuBERT/x-vector style speaker encoder) that produced the shipped embeddings offline. That distinction determines the right shape of the patch.
Environment
mlx-community/Voxtral-4B-TTS-2603-mlx-bf16