Skip to content

[WIP]LTX modular + 1:1 match + improve agent debugging skills #13360

Open
yiyixuxu wants to merge 3 commits intomainfrom
ltx23-parity-fixes
Open

[WIP]LTX modular + 1:1 match + improve agent debugging skills #13360
yiyixuxu wants to merge 3 commits intomainfrom
ltx23-parity-fixes

Conversation

@yiyixuxu
Copy link
Copy Markdown
Collaborator

No description provided.

yiyi@huggingface.co and others added 3 commits March 27, 2026 18:40
Seven fixes to achieve bit-identical output between the diffusers LTX-2.3
pipeline and the reference Lightricks/LTX-2 implementation in bf16/GPU:

1. encode_video: use truncation (.astype) instead of .round() for float→uint8,
   matching the reference's .to(torch.uint8) behavior
2. Scheduler sigma computation: compute time_shift and stretch_shift_to_terminal
   in torch float32 instead of numpy float64 to match reference precision
3. Initial sigmas: use torch.linspace (float32) instead of np.linspace (float64)
   to produce bit-identical sigma schedules
4. CFG formula: use reference formula cond + (scale-1)*(cond-uncond) instead of
   uncond + scale*(cond-uncond) to match bf16 arithmetic order
5. Euler step: upcast model_output to sample dtype before multiplying by dt,
   avoiding bf16 precision loss from 0-dim tensor type promotion rules
6. x0→velocity division: use sigma.item() (Python float) instead of 0-dim tensor,
   matching reference's to_velocity which uses sigma.item() internally
7. RoPE: remove float32 upcast in apply_interleaved_rotary_emb and
   apply_split_rotary_emb, cast cos/sin to input dtype instead — reference
   computes RoPE in model dtype (bf16) without upcasting

Also updates RMSNorm to use torch.nn.functional.rms_norm for consistency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…g skill

Model fixes:
- Cross-attention timestep: always use cross-modality sigma instead of
  conditional on use_cross_timestep (matching reference preprocessor which
  always uses cross_modality.sigma)
- This was the root cause of the remaining 3.56 pixel diff — the diffusers
  model used timestep.flatten() (2304 per-token values) instead of
  audio_sigma.flatten() (1 scalar) for cross-attention modulation

Pipeline fixes:
- Per-token timestep shape (B,S) instead of (B,) for main time_embed
- f32 sigma for prompt_adaln (not bf16)
- Audio decoder: .squeeze(0).float() to match reference output format

Parity-testing skill updates:
- Add Phase 2 (optional GPU/bf16) with same capture-inject methodology
- Add 9 new pitfalls (#19-#27) from bf16 debugging
- Decode test now includes final output format (encode_video, audio)
- Add model interface mapping as required artifact from component tests
- Add test directory + lab_book setup questions
- Add example test script templates

Result: diffusers pipeline produces pixel-identical video (0.0 diff) and
bit-identical audio waveform vs reference pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the LTX-2.3 modular pipeline structure:
- modular_pipelines/ltx2/: encoders, modular_blocks, modular_pipeline
- Registration in __init__.py, auto_pipeline.py, modular_pipeline mapping
- Checkpoint utilities for parity testing
- Supports T2V with CFG guidance (pixel-identical to reference)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants