Quantization and Compression - Bảng thuật ngữ Trí tuệ nhân tạo

📖

thuật ngữ

Post-Training Quantization (PTQ)

Precision reduction technique applied to an already trained model, without requiring complete retraining. It converts high-precision weights and activations (e.g., FP32) to lower-precision representations (e.g., INT8) to optimize inference.

📖

thuật ngữ

Quantization-Aware Training (QAT)

Method where quantization and dequantization operations are integrated into the computational graph during training. This allows the model to adapt to precision loss, minimizing performance degradation compared to PTQ.

📖

thuật ngữ

Binarized Neural Networks (BNN)

Extreme form of quantization where weights and/or activations are constrained to a single binary value (+1 or -1). It enables significant computational and memory gains by replacing multiplications with additions/subtractions.

📖

thuật ngữ

Structured Pruning

Compression technique that removes entire weight structures, such as filters, channels, or attention heads, rather than individual weights. It is more effective for accelerating computation on modern hardware than unstructured pruning.

📖

thuật ngữ

Unstructured Pruning

Compression method that eliminates individual weights in the network, typically those with the smallest magnitude. Although it can reduce model size, it requires specialized hardware support (sparsity) to accelerate computation.

📖

thuật ngữ

Low-Rank Matrix Factorization

Compression technique that decomposes a large weight matrix into two or more smaller matrices. It reduces the number of parameters and matrix multiplication operations, thus accelerating dense and convolutional layers.

📖

thuật ngữ

Knowledge Distillation

Compression process where a small model

📖

thuật ngữ

Huffman Encoding for Weights

Lossless compression method that applies the Huffman coding algorithm to model weights. It assigns shorter binary codes to the most frequent weights, reducing file size on disk without affecting inference speed.

📖

thuật ngữ

Weight Sharing

Compression technique that groups weights into clusters and replaces each weight with the index of its cluster centroid. This reduces the number of bits needed to store each weight and enables the use of lookup tables during inference.

📖

thuật ngữ

Tucker Decomposition

Form of tensor decomposition applied to weight tensors (4D convolutions) to compress them. It decomposes a tensor into a smaller core tensor and factor matrices, significantly reducing the number of parameters and computational cost.

📖

thuật ngữ

CP Decomposition (CANDECOMP/PARAFAC)

Tensor decomposition method that expresses a tensor as a sum of rank-one vector products. It is used to compress convolutional layers by approximating the weight tensor with a reduced number of components.

📖

thuật ngữ

Variable Neural Network (VNN)

Model architecture where the number of active channels in each layer can vary dynamically based on resource constraints. It allows for flexible trade-offs between accuracy and computational cost at runtime.

📖

thuật ngữ

Blockwise Quantization

Technique that divides weight or activation tensors into smaller blocks and applies independent quantization to each block. It better captures local magnitude variations, reducing overall quantization error.

📖

thuật ngữ

8-bit Floating Point Representation (FP8)

Low-precision data format using 8 bits to represent floating-point numbers, with different variants (E4M3, E5M2) for training and inference. It offers superior trade-offs compared to integer formats for certain AI workloads.

📖

thuật ngữ

Structured N:M Sparsity

Pruning scheme where, for every block of M weights, exactly N weights are preserved (N < M). This regular pattern is designed to be efficiently accelerated by specialized matrix computation units (Tensor Cores) in modern GPUs.

Thuật ngữ AI

Post-Training Quantization (PTQ)

Quantization-Aware Training (QAT)

Binarized Neural Networks (BNN)

Structured Pruning

Unstructured Pruning

Low-Rank Matrix Factorization

Knowledge Distillation

Huffman Encoding for Weights

Weight Sharing

Tucker Decomposition

CP Decomposition (CANDECOMP/PARAFAC)

Variable Neural Network (VNN)

Blockwise Quantization

8-bit Floating Point Representation (FP8)

Structured N:M Sparsity

Không tìm thấy kết quả