How to continue self-supervised pretraining of w2v-BERT (e.g. w2v-BERT 2.0) using unlabeled speech data? #3015
akbar20gh
started this conversation in
Feature Request
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm exploring whether it’s possible to continue self-supervised pretraining (CPT) of w2v-BERT models — especially w2v-BERT 2.0 — using only unlabeled speech data (without text transcripts).
My goal is to adapt the acoustic encoder to a specific speech domain before fine-tuning it for ASR.
I’m aware that for wav2vec 2.0, continued pretraining on domain-specific audio (unlabeled) is achievable via the same contrastive / masked prediction objectives used in their original pretraining
My questions are:
Is there a recommended or supported way to perform such continued self-supervised training for w2v-BERT models (either v1 or v2)?
Are the pretraining scripts or configs for this setup (mask prediction + contrastive objective) available publicly ?
If not yet available, are there implementation notes or references you could share for reproducing the pretraining pipeline for w2v-BERT?
The intended workflow is:
Start from a released pretrained checkpoint (facebook/w2v-bert-2.0)
Continue self-supervised learning with new unlabeled domain-specific audio
Then fine-tune with paired (speech, text) data for ASR downstream.
Any guidance, scripts, or clarification would be greatly appreciated.
Thanks for maintaining such an impactful model and for open-sourcing it.
Beta Was this translation helpful? Give feedback.
All reactions