Quantification and Optimization

📖

termini

Quantization Aware Training (QAT)

Optimization method where low-precision quantization simulation is integrated during training, allowing the model to adapt its weights to minimize the performance loss induced by quantization.

📖

termini

Low-Rank Adaptation (LoRA)

Efficient adaptation method that freezes the weights of a pre-trained model and injects small decomposable low-rank matrices, drastically reducing the number of trainable parameters for fine-tuning while preserving performance.

📖

termini

8-bit Floating Point Representation (FP8)

Very low-precision numerical data format using 8 bits to represent floating-point numbers, enabling significant accelerations on modern GPUs while maintaining the training stability of large models.

📖

termini

4-bit Integer Quantization (INT4)

Extreme compression technique representing model weights on 4 bits, requiring advanced quantization algorithms and often partial retraining to compensate for significant information loss.

📖

termini

Quantization Bias Compensation (Q-Bias)

Post-quantization adjustment technique that systematically analyzes and corrects the biases introduced by precision reduction, often by modifying normalization layers or the biases of linear layers.

📖

termini

Quantization Grid Search Optimization

Systematic exploration method of different quantization configurations (per-layer, per-group, mixed) to identify the optimal scheme offering the best balance between model size, speed, and precision for a given architecture.

📖

termini

Speculative Inference

Generative inference acceleration technique where a small 'draft' model quickly proposes multiple tokens, which are then validated in parallel by the large target model, reducing the total number of costly computation steps.

📖

termini

Truncated Singular Value Decomposition (Truncated SVD)

Application of SVD decomposition followed by truncation of the smallest singular values to approximate a weight matrix by a lower-rank sum, thus reducing parameters and computation with controlled error.

📖

termini

Block-wise Quantization

Quantization strategy that divides weight tensors into smaller blocks and applies independent quantization to each block, better preserving the value distribution and reducing the overall error compared to global quantization.

📖

termini

Structured Sparse Weights

Form of pruning that imposes regularity patterns (by row, column, or block) on the pruned weights, allowing efficient exploitation of hardware accelerations on CPUs/GPUs unlike random unstructured sparsity.

Glossario IA

Quantization Aware Training (QAT)

Low-Rank Adaptation (LoRA)

8-bit Floating Point Representation (FP8)

4-bit Integer Quantization (INT4)

Quantization Bias Compensation (Q-Bias)

Quantization Grid Search Optimization

Speculative Inference

Truncated Singular Value Decomposition (Truncated SVD)

Block-wise Quantization

Structured Sparse Weights

Nessun risultato trovato