Glossario IA
Il dizionario completo dell'Intelligenza Artificiale
FP16 Operations
Half-precision floating-point calculations (16 bits) offering up to 8x more throughput than FP32 on Tensor Cores, with significant reduction in memory bandwidth and energy consumption.
TensorFloat-32 (TF32)
NVIDIA hybrid numerical format using 8 exponent bits (like FP32) and 10 mantissa bits (like FP16), offering an optimal compromise between dynamic range and precision for Ampere Tensor Cores.
Warp Matrix Multiply-Accumulate (WMMA)
CUDA API allowing warps of 32 threads to efficiently perform matrix multiply-accumulate operations directly on Tensor Cores with access to fragmented registers.
CUDA Kernels for Tensor Cores
GPU programs specifically optimized to leverage Tensor Core instructions, using WMMA primitives or high-level libraries for maximum matrix throughput.
Matrix Fragmentation
Technique of partitioning matrices into smaller fragments distributed among warp threads for parallel execution on Tensor Core units, optimizing computational resource utilization.
Tensor Core Utilization
Metric measuring the percentage of cycles where Tensor Cores perform useful calculations, crucial for evaluating optimization effectiveness and identifying bottlenecks.
INT8 Quantization for Inference
Conversion of neural network weights and activations to 8-bit integers, enabling up to 32x acceleration on Tensor Cores with controlled precision degradation.
CublasLt Tensor Core Library
CUBLAS library extension optimized for Tensor Cores, offering high-performance GEMM (General Matrix Multiply) routines with native support for mixed-precision formats.
Shared Memory Tiling
Strategy for organizing data in GPU shared memory into optimal tiles for Tensor Core access, minimizing bank conflicts and maximizing bandwidth.
Warp-level Matrix Scheduling
Scheduling of matrix operations at the warp level to maximize Tensor Core pipeline utilization, accounting for latencies and data dependencies.
Tensor Core Register Pressure
Constraint related to the limited number of registers per SM, affecting the ability to parallelize Tensor Core operations and requiring a balance between occupancy and efficient unit utilization.
Deep Learning Benchmarks
Test suites like MLPerf that evaluate Tensor Core optimization performance on real neural network training and inference workloads.
Automatic Mixed Precision (AMP)
Automatic operational precision selection technique that identifies eligible Tensor Core operations and maintains FP32 copies for numerical stability.
Tensor Core Memory Coalescing
Memory access optimization to align with Tensor Core access patterns, grouping transactions into contiguous accesses to maximize throughput.
Sparse Matrix Support
Ampere Tensor Cores' ability to efficiently process structured sparse matrices, offering up to 2x acceleration for neural networks with sparsity.