How to continue self-supervised pretraining of w2v-BERT (e.g. w2v-BERT 2.0) using unlabeled speech data? #3015

akbar20gh · 2025-12-11T08:18:27Z

akbar20gh
Dec 11, 2025

I'm exploring whether it’s possible to continue self-supervised pretraining (CPT) of w2v-BERT models — especially w2v-BERT 2.0 — using only unlabeled speech data (without text transcripts).

My goal is to adapt the acoustic encoder to a specific speech domain before fine-tuning it for ASR.

I’m aware that for wav2vec 2.0, continued pretraining on domain-specific audio (unlabeled) is achievable via the same contrastive / masked prediction objectives used in their original pretraining

My questions are:

Is there a recommended or supported way to perform such continued self-supervised training for w2v-BERT models (either v1 or v2)?

Are the pretraining scripts or configs for this setup (mask prediction + contrastive objective) available publicly ?

If not yet available, are there implementation notes or references you could share for reproducing the pretraining pipeline for w2v-BERT?

The intended workflow is:

Start from a released pretrained checkpoint (facebook/w2v-bert-2.0)

Continue self-supervised learning with new unlabeled domain-specific audio

Then fine-tune with paired (speech, text) data for ASR downstream.

Any guidance, scripts, or clarification would be greatly appreciated.

Thanks for maintaining such an impactful model and for open-sourcing it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to continue self-supervised pretraining of w2v-BERT (e.g. w2v-BERT 2.0) using unlabeled speech data? #3015

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to continue self-supervised pretraining of w2v-BERT (e.g. w2v-BERT 2.0) using unlabeled speech data? #3015

Uh oh!

akbar20gh Dec 11, 2025

Replies: 0 comments

akbar20gh
Dec 11, 2025