Quantization and Compression

📖

termini

Post-Training Quantization (PTQ)

Precision reduction technique applied to an already trained model, without requiring complete retraining. It converts high-precision weights and activations (e.g., FP32) to lower-precision representations (e.g., INT8) to optimize inference.

📖

termini

Quantization-Aware Training (QAT)

Method where quantization and dequantization operations are integrated into the computational graph during training. This allows the model to adapt to precision loss, minimizing performance degradation compared to PTQ.

📖

termini

Binarized Neural Networks (BNN)

Extreme form of quantization where weights and/or activations are constrained to a single binary value (+1 or -1). It enables significant computational and memory gains by replacing multiplications with additions/subtractions.

📖

termini

Structured Pruning

Compression technique that removes entire weight structures, such as filters, channels, or attention heads, rather than individual weights. It is more effective for accelerating computation on modern hardware than unstructured pruning.

📖

termini

Unstructured Pruning

Compression method that eliminates individual weights in the network, typically those with the smallest magnitude. Although it can reduce model size, it requires specialized hardware support (sparsity) to accelerate computation.

📖

termini

Low-Rank Matrix Factorization

Compression technique that decomposes a large weight matrix into two or more smaller matrices. It reduces the number of parameters and matrix multiplication operations, thus accelerating dense and convolutional layers.

📖

termini

Knowledge Distillation

Compression process where a small model

📖

termini

Huffman Encoding for Weights

Lossless compression method that applies the Huffman coding algorithm to model weights. It assigns shorter binary codes to the most frequent weights, reducing file size on disk without affecting inference speed.

📖

termini

Weight Sharing

Compression technique that groups weights into clusters and replaces each weight with the index of its cluster centroid. This reduces the number of bits needed to store each weight and enables the use of lookup tables during inference.

📖

termini

Tucker Decomposition

Form of tensor decomposition applied to weight tensors (4D convolutions) to compress them. It decomposes a tensor into a smaller core tensor and factor matrices, significantly reducing the number of parameters and computational cost.

📖

termini

CP Decomposition (CANDECOMP/PARAFAC)

Tensor decomposition method that expresses a tensor as a sum of rank-one vector products. It is used to compress convolutional layers by approximating the weight tensor with a reduced number of components.

📖

termini

Variable Neural Network (VNN)

Model architecture where the number of active channels in each layer can vary dynamically based on resource constraints. It allows for flexible trade-offs between accuracy and computational cost at runtime.

📖

termini

Blockwise Quantization

Technique that divides weight or activation tensors into smaller blocks and applies independent quantization to each block. It better captures local magnitude variations, reducing overall quantization error.

📖

termini

8-bit Floating Point Representation (FP8)

Low-precision data format using 8 bits to represent floating-point numbers, with different variants (E4M3, E5M2) for training and inference. It offers superior trade-offs compared to integer formats for certain AI workloads.

📖

termini

Structured N:M Sparsity

Pruning scheme where, for every block of M weights, exactly N weights are preserved (N < M). This regular pattern is designed to be efficiently accelerated by specialized matrix computation units (Tensor Cores) in modern GPUs.

Glossario IA

Post-Training Quantization (PTQ)

Quantization-Aware Training (QAT)

Binarized Neural Networks (BNN)

Structured Pruning

Unstructured Pruning

Low-Rank Matrix Factorization

Knowledge Distillation

Huffman Encoding for Weights

Weight Sharing

Tucker Decomposition

CP Decomposition (CANDECOMP/PARAFAC)

Variable Neural Network (VNN)

Blockwise Quantization

8-bit Floating Point Representation (FP8)

Structured N:M Sparsity

Nessun risultato trovato