Layer Normalization
Post-LN Transformer
Original transformer architecture where layer normalization is applied after the attention and feed-forward layers, requiring more precise learning rate tuning.
← Geri