[Feat] add I2VGenXL for image-to-video generation#6665
Conversation
|
@yiyixuxu Agreed that those embeddings can be differentiated better. Cool if I open up a follow up PR to clean it up? |
Which comment is being referred to herr? |
|
Still a couple of failing tests here |
|
@DN6 could you look into it once? |
sure! |
|
Could I please see the link to the comment that is being referred to here? I am really unable to make sense of what’s meant by embedding cleanup. |
@sayakpaul This comment. But I addressed it and tests should also be fixed. |
|
Cool thanks! I think we still need to update the pipeline IDs here as well as the model card. We can then merge! |
| if self.norm_type == "ada_norm_zero": | ||
| ff_output = gate_mlp.unsqueeze(1) * ff_output | ||
| elif self.use_ada_layer_norm_single: | ||
| elif self.norm_type == "ada_norm_single": |
There was a problem hiding this comment.
Great job here! Much cleaner now :-)
| layers_per_block: int = 2, | ||
| norm_num_groups: Optional[int] = 32, | ||
| cross_attention_dim: int = 1024, | ||
| num_attention_heads: Optional[Union[int, Tuple[int]]] = 64, |
| ): | ||
| super().__init__() | ||
|
|
||
| self.sample_size = sample_size |
There was a problem hiding this comment.
should not be needed ideally because one can access it with self.config.sample_size but ok!
|
|
||
| self.transformer_in = TransformerTemporalModel( | ||
| num_attention_heads=8, | ||
| attention_head_dim=num_attention_heads, |
There was a problem hiding this comment.
| attention_head_dim=num_attention_heads, | |
| attention_head_dim=num_attention_heads, |
attention_head_dim=num_attention_heads looks a bit weird - think the class TransformerTemporalModel has bad naming here no?
There was a problem hiding this comment.
This came as a consequence of:
- [Feat] add I2VGenXL for image-to-video generation #6665 (comment)
- [Feat] add I2VGenXL for image-to-video generation #6665 (comment)
- [Feat] add I2VGenXL for image-to-video generation #6665 (comment)
See #6665 (comment) for more details.
There was a problem hiding this comment.
ohhh I saw where this is coming from. I think we should only keep attention_head_dim argument for I2VGenXLUNet, and then we can do
self.transformer_in = TransformerTemporalModel(
num_attention_heads=8,
attention_head_dim=attention_head_dim,
in_channels=block_out_channels[0],
num_layers=1,
norm_num_groups=norm_num_groups,
)There was a problem hiding this comment.
TransformerTemporalModel does not have bad naming issue, and this is a new model so we definitely do not want to get the bad naming here ( here we force the user to pass what actually is attention_head_dim as num_attention_head is very much a bad naming practice )
There was a problem hiding this comment.
I think we should keep both i.e., attention_head_dim and num_attention_head. Because if we only stick with attention_head_dim, we will end up with something like:
down_block = get_down_block(
down_block_type,
num_layers=layers_per_block,
in_channels=input_channel,
out_channels=output_channel,
temb_channels=time_embed_dim,
add_downsample=not is_final_block,
resnet_eps=1e-05,
resnet_act_fn="silu",
resnet_groups=norm_num_groups,
cross_attention_dim=cross_attention_dim,
num_attention_heads=attention_head_dim[i],
downsample_padding=1,
dual_cross_attention=False,
)See how num_attention_heads is being assigned here. This again looks pretty bad naming-wise.
There was a problem hiding this comment.
see my comments here https://github.com/huggingface/diffusers/pull/6665/files#r1477715567
I think we can still do
down_block = get_down_block(
down_block_type,
num_layers=layers_per_block,
in_channels=input_channel,
out_channels=output_channel,
temb_channels=time_embed_dim,
add_downsample=not is_final_block,
resnet_eps=1e-05,
resnet_act_fn="silu",
resnet_groups=norm_num_groups,
cross_attention_dim=cross_attention_dim,
num_attention_heads=num_attention_heads[i],
downsample_padding=1,
dual_cross_attention=False,
)
| logger = logging.get_logger(__name__) # pylint: disable=invalid-name | ||
|
|
||
|
|
||
| def _to_tensor(inputs, device): |
There was a problem hiding this comment.
Let's please not forget to clean this up after we merge (cc @sayakpaul @yiyixuxu).
This function is only used once - no need to move it into a function
| layers_per_block: int = 2, | ||
| norm_num_groups: Optional[int] = 32, | ||
| cross_attention_dim: int = 1024, | ||
| num_attention_heads: Optional[Union[int, Tuple[int]]] = 64, |
There was a problem hiding this comment.
this should be attention_head_dim, no? @DN6 @sayakpaul
There was a problem hiding this comment.
I don't think so. I am unable to find out the comment from @patrickvonplaten, but look at how it's handled in UNet3D:
We can basically get rid of attention_head_dim and use num_attention_heads throughout to rectify the incorrect naming.
Edit:
Found out the comments:
--------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: YiYi Xu <yixu310@gmail.com>
|
this pr breaks a lot of existing pipelines as it redefines norm_types from what is very commonly used (and introduced a loong time ago).
|
|
I welcome you to create an issue thread with a reproducible code snippet. We will fix it asap and if needed, do a patch release :-) |
|
i just did, i wanted to make a note here first (it took me a bit to trace the error down) |
| resnet_act_fn="silu", | ||
| resnet_groups=norm_num_groups, | ||
| cross_attention_dim=cross_attention_dim, | ||
| num_attention_heads=num_attention_heads[i], |
There was a problem hiding this comment.
so I think it is a mistake here
get_down_block() expect num_attention_heads to be correct num_attention_heads
you look at how num_attention_heads is handled inside UNet3DConditionModel:
First, it fix the "bad "model config name as soon as it receives the argument, by doing to num_attention_heads = attention_head_dim here, so at this point, the variablenum_attention_heads is "corrected" for the rest of the code
from there, we can just pass it around as it is. e.g. it is passed as num_attention_heads in get_down_block()
in our case, we:
- also have the "bad" model config, i.e. for this model
num_attention_headsis actuallyattention_head_dim - However, we only "correct" it for
TransformerTemporalModelhere. so for the rest of the code,num_attention_headsstill meansattention_head_dim- and that's incorrect. e.g. we passed "attention_head_dim" asnum_attention_headstoget_down_block()
self.transformer_in = TransformerTemporalModel(
num_attention_heads=8,
attention_head_dim=num_attention_heads,
in_channels=block_out_channels[0],
num_layers=1,
norm_num_groups=norm_num_groups,
)There was a problem hiding this comment.
I am quite confused now.
First, it fix the "bad "model config name as soon as it receives the argument, by doing to num_attention_heads = attention_head_dim here, so at this point, the variablenum_attention_heads is "corrected" for the rest of the code
This is what we didn't want to do in this UNet i.e., the comments in #6665 (comment) suggested that there shouldn't be any attention_head_dim. We want to get rid of those corrections, made in UNet3DConditionModel for num_attention_heads.
Do you mean we initialize another new variable called attention_head_dim based on what's passed to num_attention_heads?
--------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: YiYi Xu <yixu310@gmail.com>
What does this PR do?
Fixes: #6186.
Test code
For memory optimization, use
enable_model_cpu_offload().TODO
einopsdependency