Skip to content

Add SOT multi-talker ASR with Whisper#6405

Open
cyhuang-tw wants to merge 3 commits into
espnet:masterfrom
cyhuang-tw:sot-whisper-clean
Open

Add SOT multi-talker ASR with Whisper#6405
cyhuang-tw wants to merge 3 commits into
espnet:masterfrom
cyhuang-tw:sot-whisper-clean

Conversation

@cyhuang-tw
Copy link
Copy Markdown
Contributor

What did you change?

Add Serialized Output Training (SOT) for multi-talker ASR using Whisper encoder/decoder with tiktoken tokenization.

Core pipeline (espnet2/):

  • SOTWhisperModel: extends ESPnetASRModel for SOT training
  • SOTWhisperPreprocessor: tiktoken-based tokenizer handling timestamps and speaker change tokens, with support for reusing existing BPE tokens as special tokens
  • SOTBeamSearch: beam search with probability-based timestamp forcing and speaker separator preservation
  • SOTConstraintScorer: enforces valid SOT output structure (timestamp pairing, non-decreasing order, separator handling)
  • sot_postprocess: repetition truncation for hallucination prevention
  • SOTASRTask: task registration with native Whisper encoder/decoder

AMI SOT recipe (egs2/ami/sot_asr1/):

  • Pipeline following ESPnet asr.sh convention (stages 1/5/11/12/13)
  • Lhotse CutSet to Kaldi-format data preparation
  • meeteval-based utterance-group cpWER evaluation
  • Configs for Whisper-small and Whisper-tiny (testing)

Whisper encoder/decoder fixes:

  • Encoder: derive n_mels from model instead of importing removed N_MELS constant, fixing v3/turbo compatibility (128 vs 80 mel bins)
  • Decoder: handle positional embedding overflow for sequences exceeding 448 tokens

Unit tests: 22 tests covering model, preprocessor, task, and postprocessor.


Why did you make this change?

SOT (Serialized Output Training) enables multi-talker ASR by serializing multiple speakers' transcripts into a single output sequence with speaker change tokens. This approach allows standard encoder-decoder models like Whisper to handle multi-talker speech without requiring separate diarization.


Is your PR small enough?

This PR adds 34 files with ~3,800 lines. While above the typical guideline, this is a new recipe with a new task type that includes core pipeline components, an AMI recipe, and comprehensive unit tests. The components are tightly coupled (the recipe depends on the task, model, preprocessor, inference, and scorer), making it impractical to split without creating circular dependencies between PRs.


Additional Context

Add Serialized Output Training (SOT) for multi-talker ASR using
native OpenAI Whisper encoder/decoder with tiktoken tokenization.

Core pipeline:
- SOTWhisperModel: extends ESPnetASRModel with min-CE loss over
  case variants for case-invariant training
- SOTWhisperPreprocessor: tiktoken-based tokenizer handling
  timestamps and speaker change tokens, with support for reusing
  existing BPE tokens as special tokens
- SOTBeamSearch: beam search with probability-based timestamp
  forcing and speaker separator preservation
- SOTConstraintScorer: enforces valid SOT output structure
  (timestamp pairing, non-decreasing order, separator handling)
- sot_postprocess: repetition truncation for hallucination prevention

AMI SOT recipe (egs2/ami/sot_asr1):
- Pipeline following ESPnet asr.sh convention (stages 1/5/11/12/13)
- Lhotse CutSet to Kaldi-format data preparation
- meeteval-based utterance-group cpWER evaluation
- Configs for Whisper-small (production) and Whisper-tiny (testing)

Whisper encoder/decoder fixes:
- Encoder: derive n_mels from model instead of importing removed
  N_MELS constant, fixing v3/turbo compatibility
- Decoder: handle positional embedding overflow for long sequences

Unit tests: 22 tests covering model, preprocessor, task, and
postprocessor.
Copilot AI review requested due to automatic review settings April 1, 2026 23:53
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. ASR Automatic speech recogntion ESPnet2 New Features Recipe labels Apr 1, 2026
@cyhuang-tw cyhuang-tw changed the title Add SOT multi-talker ASR with native OpenAI Whisper Add SOT multi-talker ASR with Whisper Apr 1, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Serialized Output Training (SOT) multi-talker ASR recipe for the AMI dataset using native OpenAI Whisper components. Key additions include a custom SOTWhisperModel with uppercase min-CE loss, a tiktoken-based preprocessor, and a SOTConstraintScorer for structured decoding. The changes also refine the Whisper decoder to handle positional embeddings for longer sequences and update the encoder to dynamically determine Mel-bin counts. Feedback suggests improving the robustness of the inference script by avoiding generic exception handling and simplifying the positional embedding logic in the decoder for better idiomaticity.

Comment thread espnet2/asr/decoder/whisper_decoder.py Outdated
Comment on lines 179 to 184
max_pos = self.decoders.positional_embedding.size(0)
pos_len = min(tgt.size(1), max_pos)
x = (
self.decoders.token_embedding(tgt)
+ self.decoders.positional_embedding[: tgt.size(1)]
self.decoders.token_embedding(tgt[:, -pos_len:])
+ self.decoders.positional_embedding[:pos_len]
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The positional embedding slice logic is slightly complex. Using tgt.size(1) directly in the slice index is safer and more idiomatic in PyTorch when handling sequence lengths, as it avoids potential off-by-one errors or unnecessary min/max operations if the sequence length is guaranteed to be within bounds.

x = (
            self.decoders.token_embedding(tgt)
            + self.decoders.positional_embedding[: tgt.size(1)]
        )

Comment thread espnet2/bin/sot_inference.py Outdated
Comment on lines +563 to +566
except Exception as e:
logging.warning(f"Utterance {keys} failed: {e}")
hyp = Hypothesis(score=0.0, scores={}, states={}, yseq=[])
results = [(" ", ["<space>"], [2], hyp)] * nbest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Catching a generic Exception and returning a dummy result can mask critical runtime errors (e.g., OOM, device errors) that should be allowed to propagate to ensure the pipeline fails fast.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Serialized Output Training (SOT) support for multi-talker ASR using native OpenAI Whisper components (encoder/decoder + tiktoken), plus an AMI recipe and related utilities/tests.

Changes:

  • Introduces SOT-specific core components: model, tiktoken preprocessor, constraint scorer, postprocessing, and inference entrypoints.
  • Extends task/model registration to support sot_whisper and adds CLI scripts for SOT training/inference.
  • Adds an egs2/ami/sot_asr1 recipe with data prep, decoding, scoring utilities, and configs; plus new unit tests.

Reviewed changes

Copilot reviewed 30 out of 34 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
test/espnet2/train/test_sot_preprocessor.py Adds unit coverage for SOT tiktoken preprocessor behaviors (prefix, timestamps, added tokens).
test/espnet2/tasks/test_sot_asr.py Adds smoke tests for the new SOT task CLI/parser behaviors.
test/espnet2/asr/test_sot_espnet_model.py Adds unit tests for SOT Whisper model initialization and forward/backward.
test/espnet2/asr/postprocess/test_sot_postprocess.py Adds tests for repetition-truncation postprocessing utilities.
espnet2/train/sot_preprocessor.py Implements tiktoken-based SOT text parsing/tokenization and token list generation.
espnet2/tasks/sot_asr.py Registers a SOT-specific ASR task choosing SOT model + preprocessor by default.
espnet2/tasks/asr.py Adds sot_whisper into the generic ASR model choice registry.
espnet2/bin/sot_train.py Adds a SOT training entrypoint wiring to SOTASRTask.
espnet2/bin/sot_inference.py Adds SOT decoding with constraints, timestamp-forcing beam search, and postprocessing.
espnet2/asr/sot_espnet_model.py Adds SOTWhisperModel with optional uppercase min-CE attention loss.
espnet2/asr/scorers/sot_constraint_scorer.py Adds a constraint scorer enforcing valid SOT output structure during decoding.
espnet2/asr/postprocess/sot_postprocess.py Adds hallucination mitigation (repetition truncation) and SOT output reconstruction.
espnet2/asr/encoder/whisper_encoder.py Fixes Whisper encoder mel-bin handling by deriving n_mels from the model.
espnet2/asr/decoder/whisper_decoder.py Adds a workaround for positional embedding overflow in long sequences.
egs2/ami/sot_asr1/utils Recipe linkage to template utils.
egs2/ami/sot_asr1/steps Recipe linkage to template steps.
egs2/ami/sot_asr1/pyscripts Recipe linkage to template pyscripts.
egs2/ami/sot_asr1/scripts/toy_pipeline_test.sh Adds an end-to-end toy pipeline script exercising token list, training, and decoding.
egs2/ami/sot_asr1/run_decode.py Adds a simple inference helper script bypassing the DataLoader.
egs2/ami/sot_asr1/run.sh Adds the AMI SOT recipe pipeline (prep/train/decode/score).
egs2/ami/sot_asr1/local/prepare_sot.py Implements Lhotse CutSet → Kaldi-format SOT data preparation.
egs2/ami/sot_asr1/local/generate_token_list.py Adds a CLI wrapper for generating token_list from tiktoken.
egs2/ami/sot_asr1/local/generate_config_yaml.py Adds a helper to generate an inference config.yaml without training.
egs2/ami/sot_asr1/local/evaluate_sot.py Adds meeteval-based cpWER scoring for SOT outputs.
egs2/ami/sot_asr1/local/added_tokens.txt Adds the recipe’s speaker-separator token list file.
egs2/ami/sot_asr1/conf/tuning/train_sot_tiny.yaml Adds a tiny Whisper SOT training config for quick tests.
egs2/ami/sot_asr1/conf/tuning/train_sot_small.yaml Adds a small Whisper SOT training config for recipe runs.
egs2/ami/sot_asr1/conf/tuning/decode_sot.yaml Adds decoding defaults for SOT inference.
egs2/ami/sot_asr1/conf/slurm.conf Adds recipe scheduler config.
egs2/ami/sot_asr1/conf/queue.conf Adds recipe scheduler config.
egs2/ami/sot_asr1/conf/pbs.conf Adds recipe scheduler config.
egs2/ami/sot_asr1/cmd.sh Adds recipe command backend selection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread espnet2/asr/sot_espnet_model.py Outdated
sym_eos: str = "<|endoftext|>",
autocast_frontend: bool = False,
extract_feats_in_collect_stats: bool = True,
lang_token_id: int = -1,
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the default lang_token_id=-1, the condition self.lang_token_id is not None is always true, so the model prepends -1 to every target sequence. In PyTorch, -1 indexes the last embedding row, which silently corrupts training targets. Use a sentinel-aware check (e.g., self.lang_token_id != -1 / >= 0) or change the default to None so language-token prepending only happens when explicitly enabled.

Suggested change
lang_token_id: int = -1,
lang_token_id: Optional[int] = None,

Copilot uses AI. Check for mistakes.
Comment thread espnet2/asr/sot_espnet_model.py Outdated
Comment on lines +167 to +175
if hasattr(self, "lang_token_id") and self.lang_token_id is not None:
ys_pad = torch.cat(
[
self.lang_token_id.repeat(ys_pad.size(0), 1).to(ys_pad.device),
ys_pad,
],
dim=1,
)
ys_pad_lens += 1
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the default lang_token_id=-1, the condition self.lang_token_id is not None is always true, so the model prepends -1 to every target sequence. In PyTorch, -1 indexes the last embedding row, which silently corrupts training targets. Use a sentinel-aware check (e.g., self.lang_token_id != -1 / >= 0) or change the default to None so language-token prepending only happens when explicitly enabled.

Copilot uses AI. Check for mistakes.
Comment thread espnet2/asr/sot_espnet_model.py Outdated
report_wer: bool = True,
sym_space: str = "<space>",
sym_blank: str = "<blank>",
transducer_multi_blank_durations: List = [],
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces a mutable default argument ([]), which can be shared across instances and lead to unexpected state leakage. Use None as the default and assign an empty list inside __init__ when needed.

Copilot uses AI. Check for mistakes.
Comment on lines +129 to +135
last_was_timestamp = (
len(current_block) >= 1 and current_block[-1] >= self.timestamp_begin
)
penultimate_was_timestamp = (
len(current_block) < 2 or current_block[-2] >= self.timestamp_begin
)

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scorer treats any token ID >= timestamp_begin as a timestamp. If the separator token is allocated above the timestamp range (e.g., <sc> as a newly-added token at 51865+), it will be misclassified as a timestamp and break the pairing/non-decreasing constraints. Exclude custom special tokens (e.g., self._custom_special_above_ts) from timestamp detection and the timestamps list so separators do not participate in timestamp constraints.

Copilot uses AI. Check for mistakes.

# A4: Non-decreasing timestamps within current block
timestamps = [t for t in current_block if t >= self.timestamp_begin]
if timestamps:
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scorer treats any token ID >= timestamp_begin as a timestamp. If the separator token is allocated above the timestamp range (e.g., <sc> as a newly-added token at 51865+), it will be misclassified as a timestamp and break the pairing/non-decreasing constraints. Exclude custom special tokens (e.g., self._custom_special_above_ts) from timestamp detection and the timestamps list so separators do not participate in timestamp constraints.

Copilot uses AI. Check for mistakes.
Comment thread espnet2/train/sot_preprocessor.py Outdated
Comment on lines +229 to +231
# Append added tokens
for token in extra_tokens:
lines.append(token)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generate_token_list() appends all extra_tokens unconditionally, even when an ‘added’ token already exists as a single BPE token in the base vocab (which this preprocessor explicitly supports reusing). This can create duplicate token strings in token_list and artificially enlarge len(token_list), potentially triggering unintended decoder embedding expansion and downstream ID mismatches. Align token-list generation with the runtime mapping: only append tokens that require new IDs; for reused single-token BPE specials, do not append (or otherwise ensure the token list’s indices remain consistent with the IDs actually used).

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +28
RECIPE_DIR=/work/nvme/bbjs/chuang14/espnet-owsm-dtai/egs2/ami/sot_asr1
ESPNET_ROOT=/work/nvme/bbjs/chuang14/espnet-owsm-dtai
export PYTHONPATH="${ESPNET_ROOT}:${PYTHONPATH:-}"

# Activate conda environment
eval "$(conda shell.bash hook)"
conda activate espnet-owsm

cd "${RECIPE_DIR}"

# Source data from dicow_asr1 (already pre-segmented)
DICOW_DIR=/work/nvme/bbjs/chuang14/espnet-owsm-dtai/egs2/ami/dicow_asr1
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script hard-codes absolute filesystem paths and a specific conda environment, making it unusable for other users and unsuitable for inclusion in a general recipe. Convert these to relative paths (derived from the script location / pwd) and/or configurable arguments/env vars, and avoid conda activate inside the script (document environment requirements instead).

Suggested change
RECIPE_DIR=/work/nvme/bbjs/chuang14/espnet-owsm-dtai/egs2/ami/sot_asr1
ESPNET_ROOT=/work/nvme/bbjs/chuang14/espnet-owsm-dtai
export PYTHONPATH="${ESPNET_ROOT}:${PYTHONPATH:-}"
# Activate conda environment
eval "$(conda shell.bash hook)"
conda activate espnet-owsm
cd "${RECIPE_DIR}"
# Source data from dicow_asr1 (already pre-segmented)
DICOW_DIR=/work/nvme/bbjs/chuang14/espnet-owsm-dtai/egs2/ami/dicow_asr1
# Determine script location and default recipe/repo roots.
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
RECIPE_DIR="${RECIPE_DIR:-$(cd "${SCRIPT_DIR}/.." && pwd)}"
ESPNET_ROOT="${ESPNET_ROOT:-$(cd "${RECIPE_DIR}/../.." && pwd)}"
export PYTHONPATH="${ESPNET_ROOT}:${PYTHONPATH:-}"
# NOTE: This script assumes that a suitable Python/ESPnet environment
# (e.g., the "espnet-owsm" conda environment) is already activated
# before invoking this script. Environment activation is intentionally
# not performed here to keep the script portable.
cd "${RECIPE_DIR}"
# Source data from dicow_asr1 (already pre-segmented)
DICOW_DIR="${DICOW_DIR:-${ESPNET_ROOT}/egs2/ami/dicow_asr1}"

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +28
RECIPE_DIR=/work/nvme/bbjs/chuang14/espnet-owsm-dtai/egs2/ami/sot_asr1
ESPNET_ROOT=/work/nvme/bbjs/chuang14/espnet-owsm-dtai
export PYTHONPATH="${ESPNET_ROOT}:${PYTHONPATH:-}"

# Activate conda environment
eval "$(conda shell.bash hook)"
conda activate espnet-owsm

cd "${RECIPE_DIR}"

# Source data from dicow_asr1 (already pre-segmented)
DICOW_DIR=/work/nvme/bbjs/chuang14/espnet-owsm-dtai/egs2/ami/dicow_asr1
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script hard-codes absolute filesystem paths and a specific conda environment, making it unusable for other users and unsuitable for inclusion in a general recipe. Convert these to relative paths (derived from the script location / pwd) and/or configurable arguments/env vars, and avoid conda activate inside the script (document environment requirements instead).

Suggested change
RECIPE_DIR=/work/nvme/bbjs/chuang14/espnet-owsm-dtai/egs2/ami/sot_asr1
ESPNET_ROOT=/work/nvme/bbjs/chuang14/espnet-owsm-dtai
export PYTHONPATH="${ESPNET_ROOT}:${PYTHONPATH:-}"
# Activate conda environment
eval "$(conda shell.bash hook)"
conda activate espnet-owsm
cd "${RECIPE_DIR}"
# Source data from dicow_asr1 (already pre-segmented)
DICOW_DIR=/work/nvme/bbjs/chuang14/espnet-owsm-dtai/egs2/ami/dicow_asr1
# Base directories: derive from this script location by default, but allow
# overriding via environment variables (RECIPE_DIR, ESPNET_ROOT, DICOW_DIR).
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]:-${0}}")" && pwd)"
RECIPE_DIR="${RECIPE_DIR:-$(cd "${SCRIPT_DIR}/.." && pwd)}"
ESPNET_ROOT="${ESPNET_ROOT:-$(cd "${RECIPE_DIR}/../../.." && pwd)}"
export PYTHONPATH="${ESPNET_ROOT}:${PYTHONPATH:-}"
# NOTE: This script assumes that an appropriate Python environment
# (e.g., the espnet-owsm conda environment) is already activated
# before running. Do not run `conda activate` inside this script.
cd "${RECIPE_DIR}"
# Source data from dicow_asr1 (already pre-segmented)
DICOW_DIR="${DICOW_DIR:-$(cd "${RECIPE_DIR}/.." && pwd)/dicow_asr1}"

Copilot uses AI. Check for mistakes.
Comment on lines +207 to +212
sys.path.insert(0, "/work/nvme/bbjs/chuang14/mtasr/TS-ASR-Whisper/src")
from txt_norm import get_text_norm

text_norm = get_text_norm(args.text_norm)
logger.info(f"Text normalizer: {args.text_norm}")

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces a hard-coded local path dependency outside the repository, which will fail for other environments. If text normalization is required, it should be implemented within the recipe/repo (or declared as an installable dependency), and the import path should be resolved via standard Python packaging rather than an absolute sys.path insertion.

Suggested change
sys.path.insert(0, "/work/nvme/bbjs/chuang14/mtasr/TS-ASR-Whisper/src")
from txt_norm import get_text_norm
text_norm = get_text_norm(args.text_norm)
logger.info(f"Text normalizer: {args.text_norm}")
try:
from txt_norm import get_text_norm
except ImportError as e:
logger.error(
"Requested text normalizer '%s' but the 'txt_norm' package "
"is not available: %s",
args.text_norm,
e,
)
logger.error(
"Install 'txt_norm' as a dependency or adjust PYTHONPATH, "
"or rerun with '--text_norm none' to disable normalization."
)
else:
text_norm = get_text_norm(args.text_norm)
logger.info(f"Text normalizer: {args.text_norm}")

Copilot uses AI. Check for mistakes.
Comment thread egs2/ami/sot_asr1/run.sh Outdated
Comment on lines +69 to +72
--*) name=$(echo "$1" | sed 's/^--//' | sed 's/-/_/g')
eval "${name}=\"$2\""
shift 2 ;;
*) break ;;
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using eval for option parsing allows command injection via option values (e.g., values containing command substitutions). Since this is a user-facing recipe script, prefer sourcing the standard utils/parse_options.sh (the repo already provides utils/) or implement parsing without eval.

Suggested change
--*) name=$(echo "$1" | sed 's/^--//' | sed 's/-/_/g')
eval "${name}=\"$2\""
shift 2 ;;
*) break ;;
--*)
name=$(echo "$1" | sed 's/^--//' | sed 's/-/_/g')
# Ensure the option name maps to a safe variable identifier
if ! [[ "${name}" =~ ^[A-Za-z0-9_]+$ ]]; then
log "ERROR: Invalid option name '${1}'"
exit 1
fi
# Ensure there is a value following the option
if [ $# -lt 2 ]; then
log "ERROR: Option '${1}' requires an argument"
exit 1
fi
# Safely assign the value to the variable without using eval
printf -v "${name}" '%s' "$2"
shift 2
;;
*)
break ;;

Copilot uses AI. Check for mistakes.
@sw005320
Copy link
Copy Markdown
Contributor

sw005320 commented Apr 2, 2026

@cyhuang-tw, can you tell me why you need such a large change from https://github.com/espnet/espnet/tree/master/egs2/librimix/sot_asr1?

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.16%. Comparing base (b08bffa) to head (426e7cc).
⚠️ Report is 1230 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #6405       +/-   ##
===========================================
+ Coverage   56.24%   70.16%   +13.91%     
===========================================
  Files         897      787      -110     
  Lines       84919    73371    -11548     
===========================================
+ Hits        47763    51480     +3717     
+ Misses      37156    21891    -15265     
Flag Coverage Δ
test_integration_espnet2 46.77% <50.00%> (+0.87%) ⬆️
test_integration_espnetez ?
test_python_espnet2 61.22% <100.00%> (+10.32%) ⬆️
test_python_espnet3 17.45% <50.00%> (?)
test_python_espnetez ?
test_utils ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sw005320
Copy link
Copy Markdown
Contributor

sw005320 commented Apr 3, 2026

@cyhuang-tw, could you respond to my question?

@cyhuang-tw
Copy link
Copy Markdown
Contributor Author

@cyhuang-tw, could you respond to my question?

I am currently working on reusing existing files to reduce redundancy and verify functionality. I have identified several files that can be removed. I will update the PR once I finish these changes.

@Fhrozen Fhrozen added this to the v.202607 milestone Apr 7, 2026
Add a Serialized Output Training (SOT) recipe for multi-talker ASR
on the AMI meeting corpus using Whisper encoder/decoder.

Recipe (egs2/ami/sot_asr1/):
- Follows the same asr.sh-based pattern as egs2/librimix/sot_asr1
- Data preparation and utterance-group cpWER evaluation via meeteval
- Whisper-small training config

Whisper timestamp support (espnet2/):
- Add predict_timestamps option to OpenAIWhisperTokenIDConverter to
  omit <|notimestamps|> from the decoder prefix, enabling Whisper
  SOT training with timestamp prediction
- Thread predict_timestamps through CommonPreprocessor_multi
@sw005320
Copy link
Copy Markdown
Contributor

@cyhuang-tw, any update?

@cyhuang-tw
Copy link
Copy Markdown
Contributor Author

Thank you for following up. I have significantly reduced the commit size by removing parts that can be replaced by existing functions. The commit is now mostly related to local recipe files. The only change I added to the core espnet2 module is a flag that controls whether timestamps are included in the serialized outputs.

@sw005320
Copy link
Copy Markdown
Contributor

Please add results and upload models

@sw005320
Copy link
Copy Markdown
Contributor

How about making a new PR?
The history can be twisted now.
Also, the PR description is not correct.

@cyhuang-tw
Copy link
Copy Markdown
Contributor Author

How about making a new PR? The history can be twisted now. Also, the PR description is not correct.

Thanks for the suggestion. I will make a new PR and include results with models.

@sw005320
Copy link
Copy Markdown
Contributor

@cyhuang-tw, this is a reminder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ASR Automatic speech recogntion ESPnet2 New Features Recipe size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants