Transformer Scaling Laws

📖

termini

Chinchilla Scaling Law

Empirical principle established by DeepMind indicating that for optimal computational budget, model size and training data volume should be scaled isometrically, with a data/parameters ratio of approximately 20:1.

📖

termini

Power Law

Mathematical relationship of the form L(N, D, C) = A * N^α * D^β * C^γ, where loss L decreases predictably based on the number of parameters N, dataset size D, and computational budget C.

📖

termini

Scaling Transfer

Phenomenon where scaling laws observed on smaller models can accurately predict the performance of much larger models, even before their complete training.

📖

termini

Optimal Computational Budget

Resource allocation (FLOPs) that maximizes model performance for a given computational cost, by judiciously balancing model size and training data quantity.

📖

termini

Data Saturation

Point beyond which increasing training data volume no longer provides significant improvement to model performance for a given model size, indicating model underfitting.

📖

termini

Scaling Exponent

Coefficient (α, β, γ) in the power law that quantifies how efficiently performance improves when increasing the number of parameters, data size, or computational budget respectively.

📖

termini

Compute-Bound Regime

Training phase where performance is primarily limited by the available computational resources, making increasing model size more effective than increasing data.

📖

termini

Data-Bound Regime

Training phase where performance is primarily limited by the quantity and quality of available data, making increasing data volume more effective than increasing model size.

📖

termini

Predicted Test Loss

Value of the loss on a test dataset, estimated in advance using scaling laws based on model size, data size, and computational budget.

📖

termini

Critical Scaling

Model size threshold from which performance gains follow a steeper scaling law, often observed in very large language models.

📖

termini

Emergence via Scaling

Appearance of new capabilities (reasoning, understanding) that did not exist in smaller models and emerge spontaneously when model size exceeds a certain critical threshold.

📖

termini

Scaling Efficiency

Measure of performance obtained per unit of resource (parameter, data, or FLOP), allowing comparison of different allocation strategies for a given budget.

📖

termini

Chinchilla Isomorphism Hypothesis

Postulate that for a fixed computational budget, model parameter count and training tokens must be increased proportionally to achieve optimal performance.

📖

termini

Kaplan's Law

Set of initial scaling laws proposed by OpenAI that suggested performance was primarily a function of model size, with less importance given to data volume.

📖

termini

Pareto Frontier in Scaling

Set of optimal resource allocations (model size vs. data) where it is impossible to improve one factor without degrading the other, defining efficient trade-offs in scaling.

📖

termini

Scaling Performance Metric

Quantitative indicator (validation loss, perplexity, benchmark score) used to measure model effectiveness and track its improvement based on scaling different resources.

📖

termini

Predictability of Scaling

Ability of scaling laws to accurately anticipate the performance of models not yet trained, based on extrapolation of trends observed on smaller models.

📖

termini

Multi-Objective Optimization in Scaling

Process aimed at finding the best compromise between multiple conflicting objectives (performance, cost, latency) when determining the optimal model and data size.

Glossario IA

Chinchilla Scaling Law

Power Law

Scaling Transfer

Optimal Computational Budget

Data Saturation

Scaling Exponent

Compute-Bound Regime

Data-Bound Regime

Predicted Test Loss

Critical Scaling

Emergence via Scaling

Scaling Efficiency

Chinchilla Isomorphism Hypothesis

Kaplan's Law

Pareto Frontier in Scaling

Scaling Performance Metric

Predictability of Scaling

Multi-Objective Optimization in Scaling

Nessun risultato trovato