KI-Glossar
Das vollständige Wörterbuch der Künstlichen Intelligenz
Position-wise Feed-Forward Network
Neural network applied independently to each position in the sequence in the Transformer architecture, performing nonlinear transformations after the attention mechanism.
GELU Activation
Gaussian Error Linear Unit activation function used in Transformer FFNs, combining dropout and ReLU properties for stochastic regularization.
Two-layer MLP
Standard multilayer architecture of FFNs in Transformers consisting of two linear transformations with a nonlinear activation function between them.
Hidden Dimension Expansion
Dimensionality increase in the first layer of the FFN (typically 4x the model dimension) before reduction in the second layer, allowing more expressive capacity.
Feed-Forward Dimension
Intermediate dimension of the FFN in Transformers, typically four times larger than the model dimension to increase representation capacity.
Position-independent Processing
Fundamental feature of FFNs applying the same weights to all positions, unlike the attention mechanism which is position-dependent.
Swish Activation
Alternative activation function to GELU in FFNs, defined as x * sigmoid(βx), offering comparable performance with better differentiability.
GLU Variants
Gated Linear Units and their variants (GeGLU, SwiGLU) used as alternatives to standard FFNs, introducing gating mechanisms for selective information flow control.
Feed-Forward Sublayer
Individual component of the Transformer block containing the FFN, including residual connections and layer normalization to stabilize training.
Linear Transformation Matrices
Weights W1 and W2 of the FFN transforming respectively to the expanded dimension and returning to the original model dimension.
FFN Dropout
Regularization mechanism applied after activation in Transformer FFNs, randomly deactivating neurons to prevent overfitting.
Inner Layer Normalization
Application of layer normalization before or after the FFN in Transformer architecture, with pre-norm and post-norm variants affecting training stability.
Mixture of Experts FFN
Extension of standard FFNs using multiple FFN experts selectively activated by a routing network, allowing capacity increase without proportional computational increase.
ReLU-based FFN
FFN variant using ReLU as activation function, simpler but less performant than GELU for most Transformer applications.
Feed-Forward Projection
Linear projection operation in FFNs transforming representations between spaces of different dimensionalities to capture complex relationships.
Adaptive FFN
Advanced FFN architecture dynamically adjusting its parameters based on input context, improving flexibility for specific tasks.