[WIP]LTX modular + 1:1 match + improve agent debugging skills by yiyixuxu · Pull Request #13360 · huggingface/diffusers

yiyixuxu · 2026-03-28T16:27:47Z

No description provided.

Seven fixes to achieve bit-identical output between the diffusers LTX-2.3 pipeline and the reference Lightricks/LTX-2 implementation in bf16/GPU: 1. encode_video: use truncation (.astype) instead of .round() for float→uint8, matching the reference's .to(torch.uint8) behavior 2. Scheduler sigma computation: compute time_shift and stretch_shift_to_terminal in torch float32 instead of numpy float64 to match reference precision 3. Initial sigmas: use torch.linspace (float32) instead of np.linspace (float64) to produce bit-identical sigma schedules 4. CFG formula: use reference formula cond + (scale-1)*(cond-uncond) instead of uncond + scale*(cond-uncond) to match bf16 arithmetic order 5. Euler step: upcast model_output to sample dtype before multiplying by dt, avoiding bf16 precision loss from 0-dim tensor type promotion rules 6. x0→velocity division: use sigma.item() (Python float) instead of 0-dim tensor, matching reference's to_velocity which uses sigma.item() internally 7. RoPE: remove float32 upcast in apply_interleaved_rotary_emb and apply_split_rotary_emb, cast cos/sin to input dtype instead — reference computes RoPE in model dtype (bf16) without upcasting Also updates RMSNorm to use torch.nn.functional.rms_norm for consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…g skill Model fixes: - Cross-attention timestep: always use cross-modality sigma instead of conditional on use_cross_timestep (matching reference preprocessor which always uses cross_modality.sigma) - This was the root cause of the remaining 3.56 pixel diff — the diffusers model used timestep.flatten() (2304 per-token values) instead of audio_sigma.flatten() (1 scalar) for cross-attention modulation Pipeline fixes: - Per-token timestep shape (B,S) instead of (B,) for main time_embed - f32 sigma for prompt_adaln (not bf16) - Audio decoder: .squeeze(0).float() to match reference output format Parity-testing skill updates: - Add Phase 2 (optional GPU/bf16) with same capture-inject methodology - Add 9 new pitfalls (#19-#27) from bf16 debugging - Decode test now includes final output format (encode_video, audio) - Add model interface mapping as required artifact from component tests - Add test directory + lab_book setup questions - Add example test script templates Result: diffusers pipeline produces pixel-identical video (0.0 diff) and bit-identical audio waveform vs reference pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds the LTX-2.3 modular pipeline structure: - modular_pipelines/ltx2/: encoders, modular_blocks, modular_pipeline - Registration in __init__.py, auto_pipeline.py, modular_pipeline mapping - Checkpoint utilities for parity testing - Supports T2V with CFG guidance (pixel-identical to reference) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HuggingFaceDocBuilderDev · 2026-03-28T16:36:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yiyi@huggingface.co and others added 3 commits March 27, 2026 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]LTX modular + 1:1 match + improve agent debugging skills #13360

[WIP]LTX modular + 1:1 match + improve agent debugging skills #13360
yiyixuxu wants to merge 3 commits intomainfrom
ltx23-parity-fixes

yiyixuxu commented Mar 28, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yiyixuxu commented Mar 28, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants