Glosarium AI
Kamus lengkap Kecerdasan Buatan
Scaling Law
Mathematical principle establishing a predictive relationship between the performance of a language model and three key factors: model size (number of parameters), volume of training data, and computational power used.
Chinchilla Law
Specific empirical rule from DeepMind's experiments, stating that for an optimal compute budget, model size and volume of training data should be scaled isomorphically, contrary to previous assumptions.
Computational Power (Compute)
Computational resource measured in FLOPS (Floating Point Operations Per Second), which constitutes the third pillar of scaling laws and determines the duration and feasibility of training large language models.
Isomorphic Scaling
Scaling strategy where model size (N) and data volume (D) increase proportionally according to the relationship N ≈ D, thus optimizing performance for a given compute budget.
Test Loss
Performance metric, often cross-validation loss (cross-entropy loss), used as a dependent variable in scaling laws to quantify a model's effectiveness on unseen data.
Scaling Exponent
Coefficient in the power law equation (e.g., L(N) ∝ N^(-α)) that determines the rate of decrease in test loss based on the increase of a variable such as model size or data.
Scaling Transfer
Phenomenon where scaling laws observed on smaller models and more limited datasets can be extrapolated to accurately predict the performance of much larger models.
Compute Budget Optimization
Process of allocating resources between model size, data, and training time to maximize final performance under a total compute budget constraint, guided by scaling laws.
Sub-Optimal Scaling Regime
A situation where a model is trained with an imbalance between its size and the data volume, for example a large model on little data, leading to performance lower than that predicted by optimal scaling laws.
Power Law
A mathematical relationship of the form Y = aX^b that underpins AI scaling laws, describing how a performance metric (Y) systematically varies with an input resource (X) such as the number of parameters.
Number of Parameters (Model Size)
A fundamental variable in scaling laws, representing the total number of trainable weights in a neural network, which is directly correlated with the model's capacity to memorize and generalize.
Training Data Volume (Dataset Size)
The quantity of unique tokens or words used to train a model, the increase of which is essential to avoid overfitting and to realize the full performance potential predicted by scaling laws.
Predictive Performance
A model's ability to make accurate predictions on new data, quantified by test loss, and which is the target variable that scaling laws seek to optimize.
Kaplan's Hypothesis
A scaling theory preceding the Chinchilla law, which postulated that performance improved most effectively by increasing model size while keeping the number of training tokens relatively constant.
Pareto Frontier in Scaling
The set of optimal resource allocations (model size, data, compute) for which it is impossible to improve performance in one dimension without degrading performance in another, illustrating the trade-offs in scaling.
Loss Convergence
The tendency of test loss to decrease and stabilize as resources (model, data, compute) are increased, following a predictable trajectory defined by scaling laws.
Data Scaling
Axis of the Chinchilla law that examines how increasing the volume and diversity of training data impacts model performance, regardless of its size.
Model Scaling
Process of increasing the number of parameters in a language model, which, according to scaling laws, must be accompanied by a proportional increase in data to achieve optimal performance.