AI Glossary
The complete dictionary of Artificial Intelligence
Quantization Aware Training (QAT)
Optimization method where low-precision quantization simulation is integrated during training, allowing the model to adapt its weights to minimize the performance loss induced by quantization.
Low-Rank Adaptation (LoRA)
Efficient adaptation method that freezes the weights of a pre-trained model and injects small decomposable low-rank matrices, drastically reducing the number of trainable parameters for fine-tuning while preserving performance.
8-bit Floating Point Representation (FP8)
Very low-precision numerical data format using 8 bits to represent floating-point numbers, enabling significant accelerations on modern GPUs while maintaining the training stability of large models.
4-bit Integer Quantization (INT4)
Extreme compression technique representing model weights on 4 bits, requiring advanced quantization algorithms and often partial retraining to compensate for significant information loss.
Quantization Bias Compensation (Q-Bias)
Post-quantization adjustment technique that systematically analyzes and corrects the biases introduced by precision reduction, often by modifying normalization layers or the biases of linear layers.
Quantization Grid Search Optimization
Systematic exploration method of different quantization configurations (per-layer, per-group, mixed) to identify the optimal scheme offering the best balance between model size, speed, and precision for a given architecture.
Speculative Inference
Generative inference acceleration technique where a small 'draft' model quickly proposes multiple tokens, which are then validated in parallel by the large target model, reducing the total number of costly computation steps.
Truncated Singular Value Decomposition (Truncated SVD)
Application of SVD decomposition followed by truncation of the smallest singular values to approximate a weight matrix by a lower-rank sum, thus reducing parameters and computation with controlled error.
Block-wise Quantization
Quantization strategy that divides weight tensors into smaller blocks and applies independent quantization to each block, better preserving the value distribution and reducing the overall error compared to global quantization.
Structured Sparse Weights
Form of pruning that imposes regularity patterns (by row, column, or block) on the pruned weights, allowing efficient exploitation of hardware accelerations on CPUs/GPUs unlike random unstructured sparsity.