Quantification and Optimization

📖

terms

Quantization Aware Training (QAT)

Optimization method where low-precision quantization simulation is integrated during training, allowing the model to adapt its weights to minimize the performance loss induced by quantization.

📖

terms

Low-Rank Adaptation (LoRA)

Efficient adaptation method that freezes the weights of a pre-trained model and injects small decomposable low-rank matrices, drastically reducing the number of trainable parameters for fine-tuning while preserving performance.

📖

terms

8-bit Floating Point Representation (FP8)

Very low-precision numerical data format using 8 bits to represent floating-point numbers, enabling significant accelerations on modern GPUs while maintaining the training stability of large models.

📖

terms

4-bit Integer Quantization (INT4)

Extreme compression technique representing model weights on 4 bits, requiring advanced quantization algorithms and often partial retraining to compensate for significant information loss.

📖

terms

Quantization Bias Compensation (Q-Bias)

Post-quantization adjustment technique that systematically analyzes and corrects the biases introduced by precision reduction, often by modifying normalization layers or the biases of linear layers.

📖

terms

Quantization Grid Search Optimization

Systematic exploration method of different quantization configurations (per-layer, per-group, mixed) to identify the optimal scheme offering the best balance between model size, speed, and precision for a given architecture.

📖

terms

Speculative Inference

Generative inference acceleration technique where a small 'draft' model quickly proposes multiple tokens, which are then validated in parallel by the large target model, reducing the total number of costly computation steps.

📖

terms

Truncated Singular Value Decomposition (Truncated SVD)

Application of SVD decomposition followed by truncation of the smallest singular values to approximate a weight matrix by a lower-rank sum, thus reducing parameters and computation with controlled error.

📖

terms

Block-wise Quantization

Quantization strategy that divides weight tensors into smaller blocks and applies independent quantization to each block, better preserving the value distribution and reducing the overall error compared to global quantization.

📖

terms

Structured Sparse Weights

Form of pruning that imposes regularity patterns (by row, column, or block) on the pruned weights, allowing efficient exploitation of hardware accelerations on CPUs/GPUs unlike random unstructured sparsity.

AI Glossary

Quantization Aware Training (QAT)

Low-Rank Adaptation (LoRA)

8-bit Floating Point Representation (FP8)

4-bit Integer Quantization (INT4)

Quantization Bias Compensation (Q-Bias)

Quantization Grid Search Optimization

Speculative Inference

Truncated Singular Value Decomposition (Truncated SVD)

Block-wise Quantization

Structured Sparse Weights

No results found