Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Sample first-last-frame-to-video (FLF2V) script: import torch
from diffusers import LTX2ConditionPipeline
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.pipelines.ltx2.pipeline_ltx2_condition import LTX2VideoCondition
from diffusers.utils import load_image
model_id = "Lightricks/LTX-2"
device = "cuda:0"
dtype = torch.bfloat16
seed = 42
width = 768
height = 512
frame_rate = 24.0
pipe = LTX2ConditionPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe.enable_model_cpu_offload(device=device)
pipe.vae.enable_tiling()
generator = torch.Generator(device).manual_seed(seed)
prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
first_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
last_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")
first_cond = LTX2VideoCondition(frames=first_image, index=0, strength=1.0)
last_cond = LTX2VideoCondition(frames=last_image, index=-1, strength=1.0)
conditions = [first_cond, last_cond]
video, audio = pipe(
conditions=conditions,
prompt=prompt,
negative_prompt=negative_prompt,
width=width,
height=height,
num_frames=121,
frame_rate=frame_rate,
num_inference_steps=40,
guidance_scale=4.0,
generator=generator,
output_type="np",
return_dict=False,
)
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)
encode_video(
video[0],
fps=frame_rate,
audio=audio[0].float().cpu(),
audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
output_path="ltx2_cond_flf2v.mp4",
) |
|
Unfortunately the pipeline isn't quite working as of ed52c0d: Official FLF2V sample: ltx2_flf2v_official.mp4Current ltx2_cond_flf2v.mp4Not sure why the video colors are messed up at the first and last frames (where the conditions are), will debug. |
|
I think the color issue is now fixed: ltx2_cond_flf2v_fixed.mp4 |
|
The condition pipeline also works with the distilled checkpoint: FLF2V Distilled Scriptimport torch
from diffusers import LTX2ConditionPipeline, LTX2LatentUpsamplePipeline
from diffusers.pipelines.ltx2 import LTX2LatentUpsamplerModel
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.pipelines.ltx2.pipeline_ltx2_condition import LTX2VideoCondition
from diffusers.pipelines.ltx2.utils import DISTILLED_SIGMA_VALUES, STAGE_2_DISTILLED_SIGMA_VALUES
from diffusers.utils import load_image
model_id = "rootonchair/LTX-2-19b-distilled"
device = "cuda:0"
dtype = torch.bfloat16
seed = 42
width = 768
height = 512
frame_rate = 24.0
pipe = LTX2ConditionPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe.enable_model_cpu_offload(device=device)
pipe.vae.enable_tiling()
generator = torch.Generator(device).manual_seed(seed)
prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
first_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
last_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")
first_cond = LTX2VideoCondition(frames=first_image, index=0, strength=1.0)
last_cond = LTX2VideoCondition(frames=last_image, index=-1, strength=1.0)
conditions = [first_cond, last_cond]
video_latent, audio_latent = pipe(
conditions=conditions,
prompt=prompt,
negative_prompt=negative_prompt,
width=width,
height=height,
num_frames=121,
frame_rate=frame_rate,
num_inference_steps=8,
sigmas=DISTILLED_SIGMA_VALUES,
guidance_scale=1.0,
generator=generator,
output_type="latent",
return_dict=False,
)
latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained(
model_id,
subfolder="latent_upsampler",
torch_dtype=dtype,
)
upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler)
upsample_pipe.enable_model_cpu_offload(device=device)
upscaled_video_latent = upsample_pipe(
latents=video_latent,
output_type="latent",
return_dict=False,
)[0]
video, audio = pipe(
conditions=conditions,
latents=upscaled_video_latent,
audio_latents=audio_latent,
prompt=prompt,
negative_prompt=negative_prompt,
width=width * 2,
height=height * 2,
num_frames=121,
frame_rate=frame_rate,
num_inference_steps=3,
# noise_scale=STAGE_2_DISTILLED_SIGMA_VALUES[0],
sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
generator=generator,
guidance_scale=1.0,
output_type="np",
return_dict=False,
)
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)
encode_video(
video[0],
fps=frame_rate,
audio=audio[0].float().cpu(),
audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
output_path="ltx2_cond_flf2v_distilled.mp4",
)Official FLF2V distilled sample: ltx2_flf2v_distilled_official.mp4
ltx2_cond_flf2v_distilled.mp4 |
|
Thanks for the work, @dg845! Looking at the outputs, is there a discrepancy in the resolutions? The main bird subject seems a bit compressed to me in our implementation. |
sayakpaul
left a comment
There was a problem hiding this comment.
Thanks for this work! It looks very clean and the implementation also seems faithful to the original one. I left some comments, LMK if they're clear.
Additionally, I would like to see some extensions being used in the pipeline. For example, multiple conditions (multiple images) with different indices. Would it be possible?
| conditions, | ||
| image, | ||
| video, | ||
| cond_index, | ||
| strength, |
There was a problem hiding this comment.
Should we only accept condition as an input to simplify logic? If so, I think check_inputs() would then only accept condition (which could be a single item or a list of conditions). WDYT?
There was a problem hiding this comment.
The current logic follows LTXConditionPipeline in also accepting image and video arguments (and therefore also needing cond_index and strength arguments). However, I agree that only accepting conditions would probably be better because it it less ambiguous. (One reservation I have is that you would have to import LTX2VideoCondition every time, but maybe that's not a big deal.)
There was a problem hiding this comment.
(One reservation I have is that you would have to import LTX2VideoCondition every time, but maybe that's not a big deal.)
That's just one-time import no? Then one could create the conditions either as a list of LTX2VideoCondition or just LTX2VideoCondition (in case of single condition).
If so, I think that's fine?
| num_frames = (num_frames - 1) // scale_factor * scale_factor + 1 | ||
| return num_frames | ||
|
|
||
| def latent_idx_from_index(self, frame_idx: int, index_type: str = "latent") -> int: |
There was a problem hiding this comment.
It's currently just a single type. I guess we can just do it inside the caller instead of having a separate function?
There was a problem hiding this comment.
I was also thinking supporting a "data" index_type, where the index is interpreted in data (pixel) space rather than latent space, as LTXConditionPipeline appears to support, but I don't quite understand the frame_index logic in LTXConditionPipeline yet. My current understanding is that the original LTX-2 code only supports latent indices (but I might be mistaken).
There was a problem hiding this comment.
but I don't quite understand the frame_index logic in LTXConditionPipeline yet. My current understanding is that the original LTX-2 code only supports latent indices (but I might be mistaken).
If so, let's keep it inline then?
| negative_prompt_embeds: Optional[torch.Tensor] = None, | ||
| negative_prompt_attention_mask: Optional[torch.Tensor] = None, | ||
| decode_timestep: Union[float, List[float]] = 0.0, | ||
| decode_noise_scale: Optional[Union[float, List[float]]] = None, |
There was a problem hiding this comment.
A bit unrelated: I haven't seen the use of decode_noise_scale in LTX-2. If that's the case indeed, WDYT of cleaning it off from the LTX-2 pipelines?
There was a problem hiding this comment.
My understanding is that decode_noise_scale is used if the video VAE supports timestep conditioning:
diffusers/src/diffusers/pipelines/ltx2/pipeline_ltx2_condition.py
Lines 1507 to 1516 in 70dff16
The LTX-2 VAE currently uses timestep_conditioning=False. It's unclear to me whether the LTX-2 code intends to support it, as the video decoder model still accepts a timestep_conditioning argument:
but the VAE decoding code doesn't support a timestep argument that would be necessary if timestep_conditioning=True:
There was a problem hiding this comment.
Yes, I am on the same side. Do you think it could make sense to remote this logic in a separate PR then? It will simplify the code a bit.
| # Convert the noise_pred_video velocity model prediction into a sample (x0) prediction | ||
| denoised_sample = latents - noise_pred_video * sigma | ||
| # Apply the (packed) conditioning mask to the denoised (x0) sample, which will blend the conditions | ||
| # with the denoised sample according to the conditioning strength (a strength of 1.0 means we fully |
There was a problem hiding this comment.
a strength of 1.0 means we fully
(nit): However, from the code, it's not clear how this strength is incorporated. Consider expanding on the comment a little bit.
There was a problem hiding this comment.
The conditioning strengths are used here:
diffusers/src/diffusers/pipelines/ltx2/pipeline_ltx2_condition.py
Lines 866 to 869 in 70dff16
Perhaps it would more clear to say that the conditioning_mask itself specifies the strength with which the conditions are applied?
The FLF2V script in #13058 (comment) gives an example using multiple conditions that's not possible in |
I believe the discrepancy is because the original LTX-2 center crops images: but |
Should we try to implement this then? Perhaps club it inside #13084. I think that the results are better with a center crop. WDYT? |
What does this PR do?
This PR adds
LTX2ConditionPipeline, a pipeline which supports visual conditioning at arbitrary frames for the LTX-2 model (paper, code, weights), following the original code. This is an analogue ofLTXConditionPipelinefor LTX-2, as both the original LTX models and LTX-2 support a similar conditioning scheme.Fixes #12926
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sayakpaul
@yiyixuxu