GPU Architecture: CUDA Cores, Tensor Cores, and Parallel Execution

Yatin Taneja
Mar 9
9 min read

Graphics processing units function as specialized electronic circuits designed specifically for the rapid manipulation and alteration of memory to accelerate the creation of images in a frame buffer intended for output to a display device, though this architectural focus has shifted dramatically towards high-throughput parallel computation, particularly effective in workloads with regular data parallelism such as neural network training. Central processing units fine-tune their architecture for low-latency sequential tasks by relying on complex branch prediction, large caches, and out-of-order execution to minimize the time taken to complete a single thread of execution. Conversely, graphics processing units prioritize executing thousands of threads simultaneously through massive hardware parallelism, sacrificing single-thread performance and complex control logic in favor of raw computational throughput and high memory bandwidth. The performance gap between these two processor types for neural network training often reaches two orders of magnitude due to architectural differences in core count, memory bandwidth, and execution model, making GPUs the de facto standard for deep learning computations. CUDA cores serve as the core processing units in NVIDIA GPUs, with each core capable of performing a basic floating-point or integer operation per clock cycle, effectively acting as the scalar arithmetic logic units of the GPU. These cores are organized into Streaming Multiprocessors (SMs), which manage the execution, thread scheduling, and instruction dispatch for groups of these cores, allowing the GPU to handle the massive number of threads required by modern parallel algorithms.

Tensor Cores act as specialized units introduced in the Volta architecture to accelerate matrix multiply-accumulate operations, which represent the dominant computation in deep learning layers such as fully connected layers and convolutions. These cores operate on small matrix tiles such as four-by-four blocks and support mixed-precision arithmetic like FP16 input with FP32 accumulation, significantly boosting throughput for linear algebra by performing a convolution or matrix multiplication operation in a single clock cycle rather than breaking it down into individual scalar operations. GPUs execute code using the Single Instruction, Multiple Threads (SIMT) model where a single instruction is broadcast to multiple threads executing on different data elements, a framework that maximizes computational density by minimizing the instruction fetch and decode overhead per thread. Threads group into warps containing thirty-two threads on NVIDIA hardware, and all threads in a warp execute the same instruction in lockstep, meaning that at any given cycle, every active thread in a warp is performing the same operation on potentially different pieces of data. Warp divergence occurs when threads within a warp take different control paths due to conditional branching, forcing the hardware to serialize the execution of the different paths and thereby reducing efficiency because the threads following one path must wait idle while the threads following the other path execute. Warp schedulers dispatch instructions from ready warps to execution units, hiding memory latency by switching to other warps while waiting for data to return from memory, a technique known as latency hiding that is critical for maintaining high utilization despite the high latency of off-chip memory accesses.

Efficient warp scheduling requires sufficient active warps to keep execution units busy, a concept known as occupancy, which determines how many warps can reside on a Streaming Multiprocessor simultaneously given the resource constraints such as registers and shared memory per thread block. Low occupancy or poor instruction mix leads to underutilization of CUDA and Tensor Cores because there are not enough eligible warps to cover the instruction pipeline stalls caused by memory dependencies or pipeline latency. The GPU memory hierarchy includes global memory with high capacity and high latency, shared memory with low latency acting as a user-managed cache, local memory for thread spill storage, constant memory for read-only data, and texture memory for spatial locality optimizations. Shared memory partitions into banks where simultaneous accesses to the same bank cause bank conflicts, serializing otherwise parallel memory operations because the memory hardware can only service one request per bank per cycle. Proper memory access patterns such as coalesced global memory reads and conflict-free shared memory usage remain critical for achieving peak bandwidth, as uncoalesced accesses result in multiple memory transactions for a single warp request, drastically reducing effective bandwidth utilization. CUDA cooperative groups extend traditional thread-block synchronization by allowing flexible grouping of threads across blocks or the entire grid, enabling programmers to define synchronization granularities that fit the specific algorithm rather than being restricted to the block level.

This flexibility enables more complex communication patterns and adaptive load balancing, improving adaptability for irregular algorithms where the workload distribution may not be uniform across all threads. Cooperative groups require explicit programmer management and reduce reliance on global synchronization barriers, allowing for more efficient fine-grained synchronization between smaller subsets of threads. Neural network training involves repeated matrix multiplications, convolutions, and activation functions, all highly parallelizable and amenable to GPU acceleration because these operations map naturally to the SIMT execution model and the data-parallel architecture of the GPU. CPUs lack sufficient parallel execution units and memory bandwidth to sustain the computational intensity of large-scale training, forcing them to spend significantly more time processing the billions of floating-point operations required to train modern models. GPUs achieve higher arithmetic intensity by using wide SIMT execution and high-bandwidth memory like HBM3e, which provides the massive data throughput required to keep thousands of compute units fed with operands. NVIDIA dominates the GPU market for AI workloads with its CUDA ecosystem, including libraries like cuBLAS and cuDNN that provide highly fine-tuned implementations of standard linear algebra and deep learning routines, compilers like nvcc, and profiling tools that allow developers to extract maximum performance from the hardware.

AMD offers competing GPUs with the ROCm software stack, which aims to provide an open-source alternative to CUDA, while Google’s TPUs and other domain-specific accelerators target similar workloads using systolic arrays that sacrifice programmability for extreme efficiency on matrix multiplication workloads. Current commercial deployments include cloud-based GPU instances such as AWS p5 and Azure NDv4, which provide on-demand access to high-performance compute resources, on-premise AI clusters built by large technology companies for proprietary model training, and edge inference devices that bring GPU acceleration to power-constrained environments. Performance benchmarks show NVIDIA H100 GPUs delivering nearly two petaflops in FP16 precision when utilizing Tensor Cores with sparsity enabled, a metric that highlights the massive computational capability concentrated in a single device. Real-world training times for models like GPT-3 take months on CPU clusters, whereas modern GPU clusters complete the task in weeks, demonstrating the practical impact of architectural specialization on research and development cycles in artificial intelligence. Dominant GPU architectures like NVIDIA Hopper integrate thousands of CUDA cores, hundreds of Tensor Cores, and high-bandwidth memory interfaces into a single package, connected through advanced interconnects like NVLink that allow multiple GPUs to function as a single coherent entity. Appearing challengers include open-source RISC-V based accelerators that attempt to apply open instruction sets to reduce licensing costs, photonic computing prototypes that use light instead of electricity for computation to reduce power consumption and heat generation, and analog in-memory computing chips that perform computation directly within the memory array to eliminate the von Neumann hindrance.

Most alternatives remain in research or niche deployment due to a lack of software ecosystems comparable to CUDA, manufacturing maturity required to produce chips at the scale of TSMC, or adaptability to the wide range of existing neural network architectures currently in use. GPU production depends on advanced semiconductor nodes like TSMC 4N, which allow for the fabrication of billions of transistors in a small area, rare materials like high-purity silicon and specific chemicals for lithography, and precision lithography equipment using EUV scanners to etch nanometer-scale features onto wafers. Supply chain concentration in East Asia creates geopolitical risk, as the majority of advanced semiconductor fabrication facilities are located in Taiwan and South Korea, making the global supply chain vulnerable to regional disruptions. Export controls affect global access to high-end GPUs by restricting the sale of advanced components to certain nations, reflecting the strategic importance of compute power in national security and economic competitiveness. Foundry capacity constraints limit rapid scaling of GPU supply despite surging demand from hyperscalers and enterprises building out their AI infrastructure, leading to long lead times and high costs for high-performance accelerators. NVIDIA holds over ninety percent market share in data center GPUs, reinforced by CUDA’s entrenched developer base and compatibility across generations of hardware, which creates a high barrier to entry for competitors attempting to displace the established standard.

AMD and Intel invest heavily in AI-focused GPUs like MI300 and Gaudi, yet they face challenges in software maturity and ecosystem connection relative to NVIDIA's decades-long head start in building developer tools and libraries. Cloud providers develop custom silicon such as AWS Trainium and Google TPU to reduce reliance on external vendors and improve cost-performance ratios for their specific internal workloads, which often differ from the general-purpose workloads targeted by merchant silicon. Export controls on high-performance GPUs reflect strategic competition in AI capabilities between major economic powers seeking to secure technological advantages while preventing adversaries from accessing tools that could be used for military or surveillance applications. Academic research informs GPU architecture improvements such as sparsity support and new data types like FP8, often in collaboration with industry labs that provide real-world workloads and funding to translate theoretical ideas into commercial products. Industry provides real-world workloads and funding, accelerating the translation of academic ideas into commercial products by validating architectural choices against the demands of large-scale model training. Open standards like SYCL and oneAPI aim to reduce vendor lock-in by providing a unified programming model for different accelerator architectures, but have yet to displace CUDA in practice due to inertia in the developer community and the maturity of NVIDIA's proprietary tooling.

Software stacks must evolve to exploit Tensor Cores via automatic mixed-precision training and manage memory hierarchies efficiently to maximize hardware utilization without requiring developers to manually tune every kernel for every specific GPU architecture. Frameworks like PyTorch and TensorFlow abstract low-level details yet still require tuning for optimal GPU utilization, particularly regarding data loading strategies, kernel fusion to reduce memory traffic, and efficient use of tensor cores for specific layer types. Regulatory scrutiny over AI compute concentration may lead to requirements for transparency, auditability, or equitable access to high-performance computing resources to prevent monopolistic control over the infrastructure necessary for AI advancement. GPU-driven AI automation displaces jobs in routine cognitive tasks while creating demand for AI engineers, data curators, and infrastructure specialists capable of designing, deploying, and maintaining these complex systems in large deployments. New business models form around GPU-as-a-service, model marketplaces where pre-trained models are commoditized, and automated machine learning platforms that abstract the complexity of training models on distributed GPU clusters. Energy consumption of large GPU clusters raises sustainability concerns, driving interest in efficiency metrics and green AI practices that seek to minimize the carbon footprint associated with training ever-larger neural networks.

Traditional key performance indicators like floating-point operations per second (FLOPS) and latency prove insufficient for characterizing system performance in real-world scenarios, leading to new metrics including energy per training step and memory bandwidth utilization that better reflect the constraints of large-scale deployment. Model convergence time and cost-per-inference become critical for economic viability in commercial applications where profit margins depend heavily on the operational expenditure required to run inference services for large workloads. Benchmark suites like MLPerf standardize evaluation across different hardware platforms but may not capture real-world deployment constraints such as network latency in distributed training or data preprocessing overheads that dominate total runtime. Future innovations include 3D-stacked memory like HBM4 which brings memory closer to the compute dies to increase bandwidth and reduce power consumption, chiplet-based GPU designs that allow for modular scaling of compute and memory resources, support for lower precision formats like FP8 to increase throughput further, and hardware-aware neural architecture search that designs models specifically tailored to the strengths of the underlying hardware. Connection of sparsity exploitation, active tensor rematerialization to reduce memory footprint by recomputing values instead of storing them, and in-network computing where data is processed as it moves between nodes may further improve efficiency by reducing the volume of data movement across the system. Optical interconnects and near-memory processing could alleviate memory limitations in large deployments by replacing electrical signals with light for data transfer and performing computation directly within or adjacent to memory arrays to minimize data movement energy costs.

Physical limits include transistor scaling slowdowns as atomic limits are approached, thermal density challenges where removing heat from a concentrated area becomes physically difficult, and signal integrity at high frequencies, which limits the speed at which data can be transmitted across a chip or board. Workarounds involve architectural specialization where general-purpose logic is replaced by fixed-function units for common tasks, advanced packaging like CoWoS that allows different dies to be interconnected tightly on a silicon interposer, and algorithmic co-design where algorithms are modified to fit the hardware constraints rather than just fine-tuning hardware for algorithms. Power delivery and cooling become dominant constraints in large-scale deployments as the power density of GPU clusters increases, necessitating liquid cooling solutions and sophisticated power management firmware to ensure stable operation without exceeding thermal design power limits. GPU architecture exemplifies the shift from general-purpose computing to domain-specific acceleration driven by application demands in artificial intelligence where general-purpose processors cannot provide the necessary efficiency for the dominant workloads. The success of CUDA and Tensor Cores demonstrates that hardware-software co-design is essential for breakthrough performance in AI, as raw transistor counts matter less than the ability to effectively utilize those transistors for the specific mathematical operations required by neural

Superintelligence systems will require zettascale-level training infrastructure, likely built from millions of interconnected GPU-like accelerators working in concert to train models orders of magnitude larger than current best systems. Such systems may demand novel execution models beyond SIMT, such as adaptive threading, where hardware dynamically adjusts granularity based on workload characteristics, or symbolic-neural hybrid computation that combines the pattern recognition strengths of neural networks with the reasoning capabilities of symbolic logic. Memory and communication overheads could dominate unless architectures integrate reasoning, memory, and learning in unified substrates that treat computation and memory as a single entity rather than separate components connected by buses. Superintelligence may utilize GPU-derived architectures not just for training but for real-time inference, world modeling, requiring continuous updates to an internal representation of reality, and recursive self-improvement cycles, where the system modifies its own architecture or codebase. Hardware must support extreme reliability, fault tolerance, and interpretability, features currently not prioritized in standard GPU designs, which focus primarily on throughput, assuming occasional errors can be corrected or tolerated without catastrophic failure. Co-design between AI algorithms and hardware will become mandatory, with architectures evolving in lockstep with cognitive capabilities to support the development of superintelligent systems that require computational substrates fundamentally different from the GPUs used for graphics or early deep learning research.