forked from karpathy/nanoGPT
-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
In the forward pass at the inference time, why don't we add output multiplier (1/N) like the training time?
Inference time:
Line 225 in b2a5e60
| logits = self.lm_head(x[:, [-1], :]) # note: using list [-1] to preserve the time dim |
Training time:
Line 219 in b2a5e60
| x *= self.config.mup_output_alpha / self.config.mup_width_multiplier |
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels