Processing-in-Memory: Computing Where Data Lives

Yatin Taneja
Mar 9
9 min read

Processing-in-Memory (PIM) moves computation directly into memory units to eliminate data transfer between separate processor and memory components, fundamentally altering the data flow in modern computing systems. This architecture addresses the limitations intrinsic in the von Neumann architecture, where data movement between the central processing unit and random access memory consumes excessive time and power. The memory wall phenomenon creates a widening gap between processor speed and memory bandwidth, limiting performance in data-intensive applications by preventing the processor from utilizing its full capabilities due to latency in data fetch operations. The bandwidth gap between processor performance and memory bandwidth doubles approximately every three years, exacerbating the energy cost associated with fetching bits from off-chip memory locations. PIM redefines this separation by embedding computational logic within or adjacent to memory arrays, allowing data to remain stationary while operations occur upon it. The primary motivation involves reducing energy and latency costs associated with moving large volumes of data, particularly in artificial intelligence workloads where the ratio of arithmetic operations to memory accesses is low.

Analog computing in memory arrays applies physical properties of memory cells to perform arithmetic operations without explicit instruction cycles, using the innate physics of the device to calculate results instantly. Memristive crossbar arrays use resistive memory elements arranged in a grid to conduct matrix-vector multiplication via Ohm’s law and Kirchhoff’s current law, effectively performing the heavy lifting of neural network inference in a single step. Ohm’s law defines the relationship between voltage, current, and resistance to compute weighted sums in analog PIM, where the input voltage is the activation vector and the conductance is the synaptic weight. Kirchhoff’s current law states that the sum of currents entering a node equals zero, enabling parallel summation in crossbar outputs as currents from activated rows aggregate along the bit-lines. This mechanism enables massively parallel, low-precision analog matrix operations critical for neural network inference and training, bypassing the need to shuttle weight matrices back and forth through a limited bus width. Digital-to-analog converters transform input digital signals into analog voltages while analog-to-digital converters convert resulting currents back to digital outputs, serving as the essential interface between the digital world of controllers and the analog world of memory arrays.

DAC and ADC units introduce latency, area overhead, and power consumption, forming a key challenge in analog PIM systems because the efficiency gained by the analog array can be lost at the periphery. These interface units often consume more than fifty percent of total system power in high-precision analog PIM designs, creating a trade-off where increased precision requires significantly more energy for conversion. PIM architectures fall into three broad categories: logic-in-memory, compute-in-memory, and near-memory computing, each offering distinct trade-offs regarding granularity and programmability. Compute-in-memory variants using resistive RAM (ReRAM), phase-change memory (PCM), or ferroelectric FETs (FeFETs) enable in-situ analog dot-product computation by utilizing the memory cell itself as the computing element. Near-memory computing places compute units close to memory dies to reduce interconnect length compared to traditional setups, while logic-in-memory adds simple logic gates near memory cells to perform basic operations like Boolean logic directly on the bit-line. A memristor acts as a two-terminal non-volatile memory device whose resistance depends on the history of applied voltage or current, functioning as a variable resistor that remembers its state even when power is removed.

Crossbar arrays consist of grids of nanowires with memory or switching elements at intersections, enabling a dense two-dimensional setup that scales efficiently compared to the transistor-heavy layouts of standard logic. Mid-1990s proposals for intelligent RAM (IRAM) explored merging processor and memory on-die, yet were limited by fabrication complexity and the diverging process requirements of logic transistors and high-density capacitors needed for DRAM. The 2008 discovery of practical memristor devices by HP Labs revived interest in non-von Neumann computing and analog in-memory processing by providing a tangible device that exhibited the necessary non-linear charge-flux characteristics. The 2010s surge in deep learning highlighted the inefficiencies of conventional hardware for matrix-heavy workloads, accelerating PIM research as researchers realized that the multiplication-accumulation operations dominating these algorithms mapped perfectly onto passive crossbar arrays. The period from 2017 to 2020 saw the first functional ReRAM-based PIM chips demonstrating energy efficiency gains for inference tasks, proving that analog computation could be reliably captured despite device imperfections. Analog PIM devices suffer from device variability, conductance drift over time, and limited precision, typically ranging from four to eight bits, which necessitates durable algorithmic compensation strategies.

Conductance drift occurs because the physical state of the device changes over time, affecting the accuracy of stored weights as the atomic configuration of the filament or phase slowly relaxes towards a lower energy state. Manufacturing requires specialized materials such as hafnium oxide for ReRAM or germanium-antimony-tellurium for PCM, which are not universally available in standard complementary metal-oxide-semiconductor fabs. Supply chains for these specialized materials remain concentrated among a few global suppliers, creating potential constraints that could hinder mass adoption if geopolitical or logistical disruptions occur. Yield and reliability challenges increase with array density, necessitating error correction and calibration circuits that add area and complexity to the peripheral circuitry surrounding the core compute array. Economic viability hinges on volume production and compatibility with existing semiconductor supply chains, requiring manufacturers to integrate novel materials into established fabrication flows without destroying the yield of standard CMOS logic. Near-data processing in GPUs and CPUs reduces data movement yet remains bound by von Neumann constraints because the compute units are still physically distinct entities requiring data traversal over an interconnect.

Systolic arrays and tensor processing units improve data reuse while requiring frequent memory accesses for large models, eventually hitting the same bandwidth limits that constrain general-purpose processors. Optical computing offers low-latency interconnects while lacking mature memory setup and suffering from high static power required to maintain optical states or generate laser light. These alternatives address symptoms of the memory wall instead of the root cause of data movement at the architectural level, whereas PIM attacks the key separation of storage and processing. AI model sizes grow exponentially, with large language models containing hundreds of billions of parameters, demanding unprecedented memory bandwidth and energy efficiency that traditional architectures struggle to provide. Data center operators face escalating power costs where memory access can consume over sixty percent of total system energy in inference workloads, creating a strong financial imperative to adopt more efficient architectures. Edge AI applications require ultra-low-power inference where PIM’s energy-per-operation advantage becomes decisive due to the strict thermal budgets of mobile and IoT devices.

Societal demand for sustainable computing aligns with PIM’s potential to reduce the carbon footprint of large-scale AI deployments by minimizing the energy wasted on data transport. Samsung’s Aquabolt-XL HBM-PIM integrates DRAM with processing elements for AI workloads, showing two times performance improvement and seventy percent energy reduction in recommender systems by performing partial sums inside the memory buffer. Mythic’s analog matrix processor uses flash-based PIM for edge inference, achieving tera operations per second per Watt efficiency competitive with digital application-specific integrated circuits by using the floating gate transistor as an analog storage element. UPMEM offers DRAM-based PIM with programmable processors in each memory bank, used in genomics and database acceleration to offload specific filtering and scanning tasks from the host CPU. Benchmarks indicate two to ten times improvement in energy-delay product for specific workloads compared to standard GPUs, validating the effectiveness of the processing-in-memory approach for data-parallel tasks. Dominant approaches include resistive crossbar arrays for analog compute and embedded processors in dynamic random-access memory for digital PIM, representing two distinct paths forward regarding precision versus efficiency.

Developing challengers explore ferroelectric field-effect transistors for CMOS-compatible analog weights and spin-based devices for non-volatility, aiming to combine the best attributes of existing logic processes with novel storage mechanisms. Three-dimensional stacked PIM architectures utilize through-silicon vias to increase density and bandwidth while significantly shortening the distance data must travel between memory and logic layers. Analog PIM leads in energy efficiency for fixed-function tasks, while digital PIM offers programmability and higher precision for general-purpose algorithms that require strict numerical accuracy. Hybrid architectures combining both frameworks are under active investigation to balance efficiency and flexibility, allowing systems to route specific kernels to analog accelerators while maintaining digital fallbacks for control logic. Advanced packaging depends on through-silicon via and interposer technologies dominated by major foundries like TSMC, Samsung, and Intel, which provide the necessary infrastructure for bonding logic dies directly atop memory stacks. Samsung and SK Hynix lead in DRAM-integrated PIM, while Intel and Micron explore ReRAM and 3D XPoint derivatives, applying their respective expertise in memory manufacturing to drive architectural innovation.

Startups including Mythic, TetraMem, and NeuroBlade focus on analog PIM for edge AI while Google and Meta invest in custom PIM research tailored to their internal recommendation engines and large model training pipelines. NVIDIA and AMD remain committed to GPU-centric approaches while incorporating memory-centric optimizations such as high-bandwidth memory and cache hierarchies to mitigate the bandwidth constraints within existing approaches. Academic labs partner with industry on device modeling, circuit design, and system setup to develop the theoretical underpinnings required to commercialize these appearing technologies effectively. Standardization bodies like JEDEC and IEEE are defining interfaces and metrics for PIM-enabled systems to ensure interoperability between different vendors and allow for modular upgrades in data center servers. Compilers must support mixed-precision execution, analog noise modeling, and crossbar-aware operator fusion to effectively map high-level neural network graphs onto the physical constraints of memory arrays. Operating systems need new memory management policies for heterogeneous PIM resources to handle task scheduling between the host CPU, GPU accelerators, and the PIM units themselves without causing resource contention.

Data centers must adapt cooling and power delivery infrastructure to accommodate heterogeneous PIM nodes because the thermal profiles of dense analog compute arrays differ significantly from those of standard digital logic. Traditional memory vendors may shift from commodity DRAM to value-added PIM modules, altering profit structures by capturing a portion of the compute value chain previously dominated by processor manufacturers. New business models involve PIM-as-a-service, hardware-software co-design consultancies, and intellectual property licensing for crossbar architectures as companies seek to monetize their specific implementations of the technology. Job displacement in conventional accelerator design will likely be offset by growth in analog integrated circuit design, device physics, and calibration engineering as the industry demands new skill sets fine-tuned for non-von Neumann computing frameworks. Energy per inference and area efficiency replace raw tera operations per second as primary metrics for evaluating PIM hardware because raw throughput is meaningless if the energy cost exceeds thermal limits or if the device footprint prevents scaling. System-level key performance indicators include end-to-end latency, calibration overhead, and reliability to device drift, requiring a holistic view of performance that encompasses the entire software stack.

Benchmark suites like MLPerf Inference are adapting to include PIM-specific workloads and error profiles to provide fair comparisons between analog accelerators and traditional digital processors. Future applications include on-device learning without cloud retraining, using in-memory gradient computation to update weights locally based on incoming data streams. Photonic PIM combines optical interconnects with resistive memory for ultra-low-latency linear algebra, using light to transmit signals without resistive losses while using non-volatile elements for storage. Cryogenic PIM for quantum-classical hybrid systems will apply superconducting memory elements to operate at the temperatures required for quantum coherence while managing classical data preprocessing. PIM converges with neuromorphic computing through a shared emphasis on locality and sparsity, aiming to mimic the brain's efficiency by only activating neurons and synapses relevant to the current task. Connection with compute-in-memory for logic operations expands PIM beyond linear algebra into general-purpose processing by enabling bitwise operations directly within the array structure.

Synergy with sparsity-aware algorithms exploits zero-skipping in crossbar arrays to save power and time by skipping computations involving zero values, which are prevalent in large deep learning models. Core limits include thermal noise in analog signals, quantum tunneling at sub-five nanometer nodes, and conductance quantization in memristors, which restrict the minimum achievable resolution of stored weights. Engineering workarounds include differential pair sensing, periodic recalibration, stochastic computing, and error-resilient algorithm design to tolerate these physical imperfections without sacrificing system accuracy. Scaling beyond ten nanometer feature sizes requires new device physics or architectural redundancy to maintain signal integrity as the variability between individual cells increases relative to the operating range. PIM is a necessary upgradation of the computing substrate for data-intensive eras, rather than an incremental improvement, because it fundamentally resolves the energy inefficiency of data movement. Success depends on system-level co-design tolerating imperfection through algorithmic resilience rather than individual device perfection, shifting the burden from hardware engineers to software developers who must create durable learning algorithms.

The most viable path combines analog efficiency for bulk operations with digital fallback for precision-critical steps to maximize throughput while maintaining accuracy for sensitive calculations such as gradient updates or attention mechanisms. Superintelligence systems will require exascale memory bandwidth and yottascale energy efficiency, unattainable with von Neumann architectures because the energy cost of moving zettabytes of data would exceed feasible power generation capacities. PIM will enable localized, parallel knowledge retrieval and pattern completion at the memory level, mimicking biological neural efficiency where synapses participate directly in the activation process without a central fetch-decode-execute cycle. In-memory associative memory and content-addressable operations will support energetic reasoning and hypothesis generation without centralized control by allowing the system to query data based on similarity rather than specific addressing schemes. Superintelligence may treat PIM arrays as distributed cognitive substrates where data and computation co-evolve in situ, enabling a form of emergent processing that blurs the line between storage and thinking.