Thuật ngữ AI
Từ điển đầy đủ về Trí tuệ nhân tạo
Post-Training Quantization (PTQ)
Precision reduction technique applied to an already trained model, without requiring complete retraining. It converts high-precision weights and activations (e.g., FP32) to lower-precision representations (e.g., INT8) to optimize inference.
Quantization-Aware Training (QAT)
Method where quantization and dequantization operations are integrated into the computational graph during training. This allows the model to adapt to precision loss, minimizing performance degradation compared to PTQ.
Binarized Neural Networks (BNN)
Extreme form of quantization where weights and/or activations are constrained to a single binary value (+1 or -1). It enables significant computational and memory gains by replacing multiplications with additions/subtractions.
Structured Pruning
Compression technique that removes entire weight structures, such as filters, channels, or attention heads, rather than individual weights. It is more effective for accelerating computation on modern hardware than unstructured pruning.
Unstructured Pruning
Compression method that eliminates individual weights in the network, typically those with the smallest magnitude. Although it can reduce model size, it requires specialized hardware support (sparsity) to accelerate computation.
Low-Rank Matrix Factorization
Compression technique that decomposes a large weight matrix into two or more smaller matrices. It reduces the number of parameters and matrix multiplication operations, thus accelerating dense and convolutional layers.
Knowledge Distillation
Compression process where a small model
Huffman Encoding for Weights
Lossless compression method that applies the Huffman coding algorithm to model weights. It assigns shorter binary codes to the most frequent weights, reducing file size on disk without affecting inference speed.
Weight Sharing
Compression technique that groups weights into clusters and replaces each weight with the index of its cluster centroid. This reduces the number of bits needed to store each weight and enables the use of lookup tables during inference.
Tucker Decomposition
Form of tensor decomposition applied to weight tensors (4D convolutions) to compress them. It decomposes a tensor into a smaller core tensor and factor matrices, significantly reducing the number of parameters and computational cost.
CP Decomposition (CANDECOMP/PARAFAC)
Tensor decomposition method that expresses a tensor as a sum of rank-one vector products. It is used to compress convolutional layers by approximating the weight tensor with a reduced number of components.
Variable Neural Network (VNN)
Model architecture where the number of active channels in each layer can vary dynamically based on resource constraints. It allows for flexible trade-offs between accuracy and computational cost at runtime.
Blockwise Quantization
Technique that divides weight or activation tensors into smaller blocks and applies independent quantization to each block. It better captures local magnitude variations, reducing overall quantization error.
8-bit Floating Point Representation (FP8)
Low-precision data format using 8 bits to represent floating-point numbers, with different variants (E4M3, E5M2) for training and inference. It offers superior trade-offs compared to integer formats for certain AI workloads.
Structured N:M Sparsity
Pruning scheme where, for every block of M weights, exactly N weights are preserved (N < M). This regular pattern is designed to be efficiently accelerated by specialized matrix computation units (Tensor Cores) in modern GPUs.