Whisper: fall back to canonical openai/whisper-* processor when mlx-community repos lack one#712
Draft
contrapuntal wants to merge 1 commit into
Draft
Conversation
80ff9ac to
ad33fd5
Compare
…izzy#645) mlx-community whisper conversions ship weights only — no preprocessor_config.json or tokenizer files — so WhisperProcessor.from_pretrained silently fails on load and the model crashes with `ValueError: Processor not found.` on first generate(). Recover by reading the architecture signature from config.json — the 5-tuple (n_audio_state, n_mels, n_audio_layer, n_text_layer, n_vocab) uniquely identifies each canonical openai/whisper variant, including all .en English-only models and large-v3-turbo (distinguished by 4 decoder layers vs 32). Map the signature to the corresponding openai/whisper-* repo and retry the processor load there. Identifying by dims rather than directory name handles the real mlx-community landscape — ~50+ repos with arbitrary suffixes (whisper-large-v3-mlx-4bit, whisper-base-mlx-q4, whisper-base.en-mlx-fp32, whisper-large-v3-asr-4bit, etc.) and user-renamed local directories. Also tightens error handling: * Catch OSError specifically on the local load (transformers' signal for missing files) rather than bare Exception. Other failures — corrupt JSON, permission errors — propagate so a fine-tuned local checkpoint can't be silently masked by the canonical OpenAI processor (a vocab mismatch would generate garbage transcription with no error signal). * Catch ImportError specifically on the transformers import. Documents HF_HUB_OFFLINE / TRANSFORMERS_OFFLINE in the load-helper docstring so users in air-gapped environments know how to suppress the fallback's network round-trip. Handles both openai/mlx config keys (n_audio_state, n_mels, …) and HF Transformers keys (d_model, num_mel_bins, …). Fixes Blaizzy#645
ad33fd5 to
060b813
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Loading any mlx-community whisper repo (
whisper-large-v3-mlx,whisper-base-mlx-4bit,whisper-base.en-mlx, etc.) crashes on first transcription withValueError: Processor not found.These repos ship weights only, soWhisperProcessor.from_pretrainedraises during load — leaving_processor = Noneafter only a warning.This PR adds a fallback: read the architecture signature from
config.jsonand retry the processor load against the canonicalopenai/whisper-*repo that produced this architecture. Processor files are architecture-independent (~4 MB), so a one-time download recovers transcription with no user intervention.Architecture-keyed lookup
The 5-tuple
(n_audio_state, n_mels, n_audio_layer, n_text_layer, n_vocab)fromconfig.jsonuniquely identifies each canonical openai/whisper variant.vocab_size = 51864flags.enEnglish-only models;n_mels = 128flags large-v3 family;n_text_layer = 4flags large-v3-turbo. large-v1 and large-v2 share dims (their processor files are interchangeable), so the lookup maps that signature to large-v2.Identifying by dims rather than directory name handles the real mlx-community landscape uniformly — ~50+ repos with arbitrary suffixes like
-4bit,-8bit,-q4,-fp32,-asr-*, plus user-renamed local directories. Both openai/mlx config keys (n_audio_state,n_mels, …) and HF Transformers keys (d_model,num_mel_bins, …) are read.Error handling
Catches
OSErrorspecifically on the local load (transformers' signal for missing files) instead of bareException. Other failures — corrupt JSON, permission errors — propagate so a fine-tuned local checkpoint can't be silently masked by the canonical OpenAI processor; a vocab mismatch would generate garbage transcription with no error signal otherwise. Thetransformersimport catchesImportErrorspecifically.Network behavior
When the fallback fires, the canonical processor is fetched from HF Hub (~4 MB). Set
HF_HUB_OFFLINE=1orTRANSFORMERS_OFFLINE=1in air-gapped environments — transformers raises a clear offline-mode error rather than waiting on a network timeout. Documented in the load-helper docstring.Behavior
Before, on
mlx-community/whisper-base-mlx-4bit(or any non-canonical name):After:
Repos whose architecture isn't a recognized canonical variant, or that lack a readable
config.json, preserve the existing "warn and set_processor = None" behavior.Tests: 11 unittest cases covering dim-based resolution (tiny / quantized / large-v3 / large-v3-turbo /
.en/ HF Transformers config format), behavior preservation (missing config / unknown dims / local success), and error propagation (ValueErrorpropagates, canonical fallback failure leaves_processor = None).Fixes #645