Tensor Processing Units: Google's Custom AI Accelerators

Yatin Taneja
Mar 9
12 min read

The rapid expansion of deep learning workloads in the early 2010s exposed the limitations of general-purpose processors regarding the computational intensity required for modern neural networks. General-purpose CPUs excelled at sequential logic and complex control flows, yet lacked the raw arithmetic throughput needed for the billions of multiply-accumulate operations intrinsic in training deep models. Graphics processing units offered higher throughput through massive parallelism, yet their architecture remained fine-tuned for rasterization and texture mapping, leading to inefficiencies when executing the dense linear algebra kernels that dominate machine learning algorithms. Google's internal analysis showed training large neural networks required orders of magnitude more compute than CPUs or GPUs could efficiently deliver, creating a significant economic and operational barrier to scaling models to the sizes necessary for human-level language understanding and image recognition. This computational gap drove the investigation into domain-specific architectures capable of delivering superior performance per watt for tensor operations compared to existing off-the-shelf components. Academic research on domain-specific architectures demonstrated the potential for specialized hardware in neural network inference and training, suggesting that removing logic dedicated to general-purpose tasks could free up area and power for mathematical computation.

Google initiated the TPU project in 2013 to design a custom accelerator fine-tuned for TensorFlow workloads, aiming to address the specific needs of the company’s rapidly growing AI services. The design prioritizes maximizing throughput for dense matrix multiplication, the dominant operation in neural networks, rather than fine-tuning for low latency or complex branching found in traditional computing workloads. Engineers minimized data movement by keeping operands close to compute units, recognizing that shuttling data between memory and arithmetic logic units consumes significantly more energy than performing the computation itself. The architecture accepts lower numerical precision to reduce hardware complexity and power consumption, acknowledging that neural network training algorithms exhibit a high degree of tolerance to numerical noise. Designers aimed for deterministic and predictable execution to simplify compiler and runtime control, ensuring that performance characteristics remain consistent across different executions to aid in debugging and system optimization. This focus on determinism contrasts with the non-deterministic execution often found in general-purpose computing environments where thread scheduling and cache coherence introduce variability.

The TPU utilizes a systolic array-based matrix multiply unit fed by a unified buffer and controlled by a host CPU, establishing a clear division of labor where the host manages the overall flow while the accelerator handles the heavy mathematical lifting. Data flows through the matrix multiply unit in a 2D grid where each processing element performs multiply-accumulate operations, passing results rhythmically to neighboring elements to minimize the need for immediate access to the main memory hierarchy. This rhythmic data flow mimics the pumping action of a heart, pushing data through the computation engine at a steady rate to maximize utilization. The weight-stationary data flow ensures weights are preloaded and reused across multiple input activations, drastically reducing the bandwidth requirements for fetching model parameters from external memory. By keeping the weights stationary within the processing elements and streaming the activation data through the array, the architecture minimizes the energy expended on weight retrieval, which constitutes a major portion of the energy consumption in deep learning inference and training. The host issues high-level instructions while the TPU executes them autonomously with minimal intervention, allowing the host CPU to manage other tasks or oversee multiple accelerators simultaneously.

A systolic array consists of a grid of simple processing elements that pass data rhythmically in fixed directions, creating a highly predictable pattern of data movement that allows for efficient pipelining and resource utilization. This structure enables high bandwidth and low latency for matrix operations because data moves between adjacent processing elements rather than traversing a global bus, reducing contention and latency significantly. The Matrix Multiply Unit serves as a dedicated hardware block that computes large matrix products in a single operation, acting as the computational heart of the TPU. In later generations, this unit expanded in size to accommodate larger matrices, increasing the throughput per clock cycle. Reduced precision involves using fewer bits per number to increase compute density and reduce memory footprint, allowing more mathematical operations to occur within the same silicon area and power budget. Bfloat16 serves as a 16-bit floating-point format with the same exponent range as FP32, utilizing eight bits for the exponent and seven bits for the mantissa compared to the eight bits used in FP32.

This format enables direct casting without loss of active range, ensuring that numbers do not overflow or underflow during the conversion process, which is critical for maintaining training stability in deep networks. The preservation of the exponent range means that Bfloat16 can represent very large and very small numbers with similar fidelity to FP32, while the reduced mantissa precision provides sufficient granularity for gradient updates during backpropagation. TPU Pods consist of interconnected clusters of TPUs used for large-scale model training, allowing researchers to treat thousands of individual chips as a single coherent accelerator. A TPU v4 Pod consists of 4096 chips interconnected via high-speed fabric, creating a system capable of exaflops of computing power dedicated to machine learning workloads. Google decided in 2014 to skip GPU compatibility and build a clean-slate ASIC for TensorFlow, freeing the design team from the constraints of legacy graphics pipelines or instruction sets. The adoption of bfloat16 in TPU v2 occurred after internal studies showed it preserved model accuracy while doubling throughput over FP16, validating the hypothesis that reduced numerical precision could accelerate training without sacrificing the quality of the resulting models.

The architecture shifted from single-chip inference in TPU v1 to multi-chip training systems in TPU v2 and later versions, necessitating the development of sophisticated high-speed interconnects to synchronize gradients across multiple devices. TPU v4 introduced optical interconnects in pods to overcome electrical I/O constraints in large deployments, utilizing light to transmit data between racks with lower latency and higher bandwidth than traditional copper cabling. Optical switching allows the fabric topology to be reconfigured on the fly, fine-tuning the communication patterns for different neural network architectures without requiring physical changes to the data center wiring. TPU v5 further improves interconnect bandwidth and memory capacity, continuing the trend of increasing data movement capabilities to keep pace with the growing size of neural network parameters. Power and thermal limits cap clock frequency and die area, imposing physical constraints on how fast the processing elements can operate and how many can fit on a single piece of silicon. As transistor sizes shrink, leakage current increases, making it difficult to raise clock speeds without exceeding the thermal design power limits of the package.

TPUs run at modest clock speeds to stay within thermal design power limits, prioritizing throughput through massive parallelism rather than single-thread performance. This approach contrasts with the race for higher gigahertz in general-purpose CPUs, acknowledging that Dennard scaling has ended and that performance gains must now come from architectural efficiency. Memory bandwidth remains a primary hindrance in computing systems, often referred to as the memory wall, where the speed at which data can be supplied to the processor lags behind the speed at which the processor can perform operations. High Bandwidth Memory adoption in v2 and later versions feeds the Matrix Multiply Unit effectively, stacking memory dies directly on top of the processor package to provide a wide data bus and high transfer rates. This vertical setup reduces the distance data must travel, increasing bandwidth and reducing latency compared to traditional off-package memory modules. Scaling beyond single pods requires low-latency and high-bandwidth inter-chip links to ensure that all processors remain synchronized during the training process.

As models grow into the trillions of parameters, the communication overhead associated with synchronizing gradients across thousands of devices can dominate the training time if the interconnect fabric lacks sufficient bandwidth. Fabrication cost and yield constrain die size, making it economically unfeasible to create a single monolithic chip large enough to hold the entire computational engine required for frontier AI models. Consequently, the system relies on advanced packaging techniques to combine multiple dies into a single functional unit, balancing the cost of manufacturing smaller chips with the performance benefits of a unified accelerator. The Matrix Multiply Unit occupies roughly one quarter of the die area in recent TPUs, highlighting the central importance of matrix multiplication in the overall architecture while leaving space for memory controllers, interconnects, and I/O logic. GPUs were rejected due to overhead from graphics pipelines and inefficient control logic for static graphs, as GPUs include substantial silicon area dedicated to texture units, geometry processing, and rasterization that provide no benefit for deep learning workloads. Additionally, GPUs rely on complex caching hierarchies and speculative execution techniques that are less effective for the predictable memory access patterns of neural network computations compared to the explicit data management of a systolic array.

FPGAs were rejected for lower peak performance and higher development complexity compared to custom ASICs, as FPGAs offer programmability at the cost of lower clock speeds and higher resource utilization for implementing basic arithmetic operations. Custom vector processors were rejected because they require frequent data movement and lack systolic efficiency, relying on loading vectors into registers repeatedly, which increases energy consumption compared to keeping data stationary within the array. In-memory computing was rejected due to immaturity and limited programmability for general machine learning workloads, as manufacturing processes for analog memory arrays were not yet capable of producing devices with the precision and reliability required for training large-scale neural networks. While promising for edge inference due to extreme energy efficiency, analog computing techniques struggled with noise accumulation and device variation that made them unsuitable for the high-precision requirements of backpropagation. Training frontier models requires compute that scales superlinearly with model size, meaning that doubling the size of a model often requires more than double the computational resources to train within a reasonable timeframe. General-purpose hardware cannot keep pace economically with these demands, as the fixed costs of maintaining legacy instruction sets and general-purpose logic dilute the efficiency gains achievable through specialization.

Cloud providers monetize AI services by using custom accelerators to reduce operational costs, passing on some savings to customers while retaining margins through higher utilization rates. Energy efficiency of TPUs enables deployment in data centers with strict carbon budgets, allowing companies to expand their AI capabilities without proportionally increasing their carbon footprint. Google Cloud offers TPU v4 and v5 instances for public and enterprise use, providing researchers and businesses with access to the same hardware infrastructure used to train Google's internal models. This democratization of compute allows smaller organizations to experiment with large-scale models that would previously have been prohibitively expensive to train using general-purpose cloud instances. TPU v4 achieves approximately 275 TFLOPS per chip using bfloat16 precision, representing a massive increase in performance per dollar compared to previous generations of hardware. Internal benchmarks show significant performance per watt advantages versus contemporary GPUs on transformer workloads, which form the backbone of modern large language models.

These accelerators trained Google’s PaLM and Gemini models, demonstrating their capability to handle the most demanding machine learning tasks in existence. NVIDIA GPUs dominate the broad AI market with Tensor Cores and a strong software ecosystem, offering a versatile platform that supports a wide range of precision formats and programming models beyond just TensorFlow. Competitors include AMD MI300, Intel Gaudi, and Cerebras Wafer-Scale Engine, each pursuing different architectural approaches such as chiplet designs or massive monolithic dies to capture market share in the AI accelerator space. TPUs differentiate via tight software-hardware co-design and pod-scale interconnects, fine-tuning the entire stack from the compiler down to the physical layer specifically for tensor processing workloads. This vertical connection allows Google to introduce new features or numerical formats rapidly without relying on third-party vendors to update drivers or toolchains. TSMC fabricates these chips using advanced process nodes, using the foundry's leading-edge manufacturing capabilities to achieve high transistor density and energy efficiency.

The architecture relies on High Bandwidth Memory stacks from Samsung and SK Hynix, securing supply chains for critical components that dictate the overall memory bandwidth available to the system. Packaging requires sophisticated 2.5D interposers to connect the logic dies with the memory stacks and optical I/O controllers, creating a single package that functions as a cohesive system. The supply chain depends on global semiconductor logistics and international trade regulations, exposing the infrastructure to geopolitical risks that can disrupt production schedules or limit access to advanced manufacturing technologies. Google maintains a vertically integrated stack combining hardware, software, and cloud services, allowing it to improve every layer of the infrastructure for its specific objectives. NVIDIA focuses on the broad AI market with strong developer tools, catering to a wider audience of researchers and engineers who require flexibility across different frameworks and use cases. Amazon targets AWS customers with Trainium and Inferentia chips, providing cost-effective alternatives fine-tuned specifically for inference and training within the Amazon Web Services ecosystem.

Startups offer niche performance claims while often lacking scale and reliability, struggling to compete with established players who can amortize research and development costs over massive production volumes. Google publishes TPU architecture details to enable academic research and MLPerf benchmarks, building an environment of transparency that allows external researchers to validate performance claims and study novel computer architectures. Universities use Cloud TPUs for large-scale experiments to inform future designs, providing valuable feedback on how real-world workloads interact with specialized hardware. Open-source compilers like XLA allow external optimization and porting of models, lowering the barrier to entry for researchers who wish to utilize the hardware without adopting Google's proprietary software stack entirely. Software must be compiled via XLA or JAX to exploit TPU-specific features, requiring developers to adapt their code to take advantage of the systolic array and low-precision arithmetic capabilities effectively. Data centers require higher power density and cooling capacity for TPU pods, necessitating upgrades to facility infrastructure to handle the concentrated heat output of thousands of accelerators running at full load.

Regulatory frameworks regarding cloud-based AI compute continue to evolve, potentially imposing restrictions on the export or use of advanced computing hardware for national security reasons. Reduced training costs enable smaller firms to compete in model development, lowering the barrier to entry for creating modern AI systems and encouraging innovation across the industry. Compute-as-a-service models allow access to TPUs without owning hardware, transforming capital expenditures into operational expenditures and making high-performance computing accessible to a broader range of entities. Traditional HPC workloads are shifting toward AI-fine-tuned infrastructure, as scientists discover that machine learning techniques can accelerate simulations and data analysis in fields like genomics and materials science. New roles focus on compiler optimization and distributed training engineering, reflecting the growing complexity of managing large-scale accelerator clusters and fine-tuning software for heterogeneous hardware environments. Metrics will move beyond FLOPS to training time per dollar and energy per token, emphasizing practical efficiency rather than theoretical peak performance as the primary measure of value for AI hardware.

End-to-end system efficiency becomes critical as models grow, forcing engineers to look beyond individual component performance and improve the interaction between storage, networking, computation, and software. Reliability and fault tolerance at pod scale require new monitoring KPIs, as the probability of a component failure increases significantly when thousands of chips operate in unison for extended periods. Future designs will integrate on-chip memory closer to the Matrix Multiply Unit, reducing the distance data must travel and further alleviating the memory bandwidth constraint through hierarchical memory schemes. Support for sparse computation will accelerate pruned or mixture-of-experts models, allowing hardware to skip zero-valued weights in large matrices to effectively multiply throughput without increasing power consumption. Optical interconnects will scale to multi-pod systems for exascale training, enabling the creation of compute clusters that span multiple data centers while maintaining low-latency communication between nodes. Hardware will co-evolve with next-generation neural architectures like state-space models, moving beyond the transformer architecture that currently dominates the field toward structures that require different computational primitives.

TPUs will increasingly handle non-ML workloads like scientific simulation via compiler abstractions, using their massive parallelism for tasks that can be expressed as tensor operations even if they do not involve neural networks. The setup with quantum-classical hybrid systems will occur for specific subroutines, using TPUs to pre-process data or interpret results from quantum processors that solve specific optimization problems intractable for classical computers alone. Potential synergy with neuromorphic computing will enable ultra-low-power inference, combining the energy efficiency of spiking neural networks with the raw power of digital accelerators for edge applications. Power density and heat dissipation will limit further clock scaling, reinforcing the industry trend toward wider parallelism rather than faster single-thread performance as the primary path to higher computational capability. Focus will shift to parallelism and efficiency to overcome physical limits, driving innovation in interconnect technology and memory architecture to ensure that thousands of cores remain fed with data. The memory wall will persist as a challenge in computer architecture, necessitating continued innovation in data compression, on-chip caching strategies, and software algorithms that improve data locality.

Workarounds will include model parallelism and smarter data prefetching, hiding latency by overlapping communication with computation and predicting which data will be needed next before it is requested. Interconnect latency will limit pod flexibility until photonic solutions mature, restricting the ability to dynamically reconfigure clusters based on workload demands without introducing significant communication overheads. Superintelligence systems will require massive, reliable, and energy-efficient compute substrates to function continuously without interruption or excessive resource consumption that would render them economically unviable. TPUs will provide a template for scalable and deterministic execution, offering a blueprint for future systems designed specifically for the rigors of advanced artificial intelligence. Deterministic hardware will simplify formal verification and reduce risks in autonomous reasoning, ensuring that the behavior of complex systems can be predicted and verified mathematically rather than relying solely on empirical testing. Engineers will deploy TPU-like pods as foundational compute layers for recursive self-improvement cycles, providing the raw material necessary for an intelligence to analyze its own source code and generate improvements.

Bfloat16-native hardware will maintain numerical stability during long-goal training of self-modifying agents, preventing rounding errors from accumulating catastrophically over extended periods of autonomous operation. Systolic arrays will facilitate efficient simulation of internal world models, allowing systems to predict the consequences of actions rapidly by running parallel simulations within the hardware matrix units. Setup with secure enclaves will enforce alignment constraints at the silicon level, embedding safety mechanisms directly into the hardware to prevent unauthorized modifications or deviations from established operational parameters.