fix: filter weight decay for LayerNorm, biases, and special tokens by Luodian · Pull Request #66 · EvolvingLMMs-Lab/OneVision-Encoder

Luodian · 2026-01-10T08:01:52Z

Summary

Implement proper weight decay filtering following MAE/DeiT/OpenCLIP best practices
Exclude 1D parameters and special tokens from weight decay

Problem

The current training applies weight_decay=0.05 to ALL parameters, including:

1D parameters: LayerNorm weights, RMSNorm weights, all biases
Special tokens: learned probe, pos_embed, cls_token, etc.

This is suboptimal because:

LayerNorm/RMSNorm scale parameters control feature scaling; decaying them reduces model capacity
Biases don't benefit from weight decay (no overfitting risk from bias terms)
Learned tokens like probe (shape 1,1,C) shouldn't be shrunk toward zero

Solution

Add build_adamw_param_groups() function that:

Shape-based rule: Exclude param.ndim < 2 (catches all 1D params)
Name-based rule: Exclude .bias suffix
Token-based rule: Exclude probe, pos_embed, cls_token, mask_token, query_tokens, latents

Evidence from Major Repos

Repository	Pattern
MAE	`p.ndim == 1` + `no_weight_decay_list`
DeiT	timm-style `no_weight_decay()`
OpenCLIP	Two-group AdamW, WD=0 for `p.ndim < 2`
SigLIP	WD only on `kernel` (weight matrices)

Files Changed

training/train.py - Add helper function and modify optimizer setup

Edge Cases Handled

probe in pooling head: shape (1, 1, C) with ndim=3, explicitly excluded by name
Conv2d bias: patch embedding uses bias=False, no issue
MultiheadAttention biases: 1D, caught by ndim < 2 rule

Problem: - Weight decay was applied to ALL parameters including: - 1D parameters (LayerNorm/RMSNorm weights, biases) - Special learned tokens (probe, pos_embed, cls_token) - This is suboptimal as these parameters don't benefit from shrinkage Solution: - Add build_adamw_param_groups() function following MAE/DeiT/OpenCLIP patterns - Exclude from weight decay: - Parameters with ndim < 2 (catches all 1D params like norm weights, biases) - Parameters ending in '.bias' - Special tokens: probe, pos_embed, cls_token, mask_token, query_tokens, latents - Apply per-group weight decay instead of global optimizer-level decay References: - MAE: facebookresearch/mae/util/lr_decay.py - OpenCLIP: excludes 'p.ndim < 2' and 'bn/ln/bias' from decay - DeiT: timm-style no_weight_decay() returning {pos_embed, cls_token}

anxiangsir force-pushed the main branch 2 times, most recently from 3b0d868 to f7e7781 Compare January 28, 2026 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: filter weight decay for LayerNorm, biases, and special tokens#66

fix: filter weight decay for LayerNorm, biases, and special tokens#66
Luodian wants to merge 1 commit intomainfrom
fix/weight-decay-filtering

Luodian commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luodian commented Jan 10, 2026

Summary

Problem

Solution

Evidence from Major Repos

Files Changed

Edge Cases Handled

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant