Thuật ngữ AI
Từ điển đầy đủ về Trí tuệ nhân tạo
Memory Coalescing
GPU optimization technique where contiguous memory accesses from threads are grouped into single transactions, reducing memory bandwidth and increasing throughput.
Cache Blocking
Data partitioning strategy into cache-sized blocks to maximize local data reuse and minimize cache misses.
NUMA-Aware Allocation
Memory allocation that considers Non-Uniform Memory Access architecture to place data near the cores that frequently use them, reducing access latency.
Memory Pooling
Pre-allocation of a large memory block subdivided into reusable objects, eliminating the overhead of frequent dynamic allocations/deallocations.
Zero-Copy Optimization
Technique allowing operations to directly access data without intermediate copying between memory spaces, reducing CPU consumption and bandwidth.
Register Tiling
Use of processor registers to temporarily store data tiles, minimizing accesses to slower hierarchical memory.
Prefetching Instructions
Special instructions that preload data into cache before actual use, hiding memory latency through computation/access overlap.
Memory Footprint Reduction
Set of techniques (quantization, pruning, compression) aimed at reducing the memory size of AI models without significant performance degradation.
Shared Memory Utilization
Optimization of GPU shared memory usage as a fast and reusable data space between threads of the same block.
Memory Bandwidth Saturation
State where memory access demands exceed the capacity of the memory bus, becoming the main bottleneck of computing performance.
Page Migration
Dynamic movement of memory pages between NUMA nodes based on access patterns to optimize data locality.
Memory-Aware Scheduling
Task scheduling that takes into account memory constraints and access patterns to minimize contentions and maximize parallelism.
Cache-Oblivious Algorithms
Algorithms designed to perform efficiently on any cache hierarchy without requiring specific cache size parameters.
Memory Hierarchy Optimization
Global strategy for data placement according to their access frequency and temporal criticality across the levels of the memory hierarchy.
Tensor Core Memory Layout
Specific organization of tensors in memory to maximize the efficiency of matrix operations on NVIDIA Tensor Cores.
Memory Access Divergence
Phenomenon where threads in a GPU warp access non-contiguous memory addresses, degrading performance through serialization of accesses.
HBM (High Bandwidth Memory) Integration
3D stacked memory architecture offering superior bandwidth for intensive AI workloads, with specific optimization of access patterns.
Memory-Mapped I/O Optimization
Technique allowing peripheral devices to directly access system memory, reducing copies and CPU overhead in AI pipelines.