CUDA Programming - AI Glossary

📖

terms

Kernel

CUDA function executed on the GPU by a large number of threads simultaneously. The kernel is launched from the CPU and executed in parallel on the GPU device with a specific grid and block configuration.

📖

terms

Thread

Basic execution unit in CUDA, representing a single sequence of instructions executed on a GPU processor core. Threads are organized into blocks and execute the same code on different data.

📖

terms

Block

Collection of threads that can communicate with each other via shared memory and synchronize their execution. Blocks are organized into a grid and execute on the same Streaming Multiprocessor (SM).

📖

terms

Grid

Set of thread blocks that constitute the complete execution configuration of a CUDA kernel. The grid represents the highest hierarchical structure of thread organization in CUDA.

📖

terms

Warp

Group of 32 threads that execute simultaneously in SIMT (Single Instruction Multiple Thread) mode on a CUDA SM. All threads in a warp execute the same instruction at the same clock cycle.

📖

terms

Shared Memory

Fast, small-sized memory shared by all threads of the same block, enabling efficient communication between threads. Shared memory is much faster than global memory but limited in size per block.

📖

terms

Global Memory

Main memory accessible by all threads and the CPU, with large capacity but high latency. Global memory persists between kernel launches and constitutes the main data storage area.

📖

terms

CUDA Runtime API

High-level programming interface that simplifies CUDA application development by automatically managing initialization, module loading, and memory management. It provides functions such as cudaMalloc, cudaMemcpy, and cudaLaunchKernel.

📖

terms

Stream

Sequence of operations executed on the GPU in a determined order, enabling parallelism between computation operations and memory transfers. Streams allow concurrent execution of kernels and overlapping of transfers.

📖

terms

Asynchronous Execution

CUDA execution mode where operations return immediately to the CPU without waiting for their completion on the GPU. Asynchronous execution allows overlapping computations and transfers to maximize GPU utilization.

📖

terms

Texture Memory

Memory optimized for 2D or 3D spatial locality accesses, with automatic data caching. Texture memory is particularly efficient for image processing and accesses with low coherence.

📖

terms

Constant Memory

Read-only memory optimized for broadcast accesses where all threads read the same value simultaneously. It is particularly efficient when all threads in a warp access the same address.

📖

terms

Occupancy

Measure of the ratio between the number of active warps and the maximum number of warps that can be resident on a Streaming Multiprocessor. High occupancy does not necessarily guarantee better performance but helps hide latency.

📖

terms

Atomic Operations

Read-modify-write operations executed atomically on global or shared memory, guaranteeing no conflicts between threads. They are essential for reductions and concurrent data updates.

📖

terms

cuBLAS

CUDA Basic Linear Algebra Subroutines library providing optimized GPU implementations for basic linear algebra operations. cuBLAS significantly accelerates matrix and vector computations on NVIDIA architectures.

📖

terms

cuFFT

CUDA Fast Fourier Transform library offering high-performance GPU implementations for discrete Fourier transforms. cuFFT supports 1D, 2D, and 3D transformations with different precisions and sizes.

AI Glossary

Kernel

Thread

Block

Grid

Warp

Shared Memory

Global Memory

CUDA Runtime API

Stream

Asynchronous Execution

Texture Memory

Constant Memory

Occupancy

Atomic Operations

cuBLAS

cuFFT

No results found