Ray: Distributed Computing for ML Workloads

Yatin Taneja
Mar 9
14 min read

Ray Core forms the foundational layer of the distributed computing stack, providing low-level APIs that facilitate the creation of tasks and actors while managing the underlying object store and cross-node communication protocols through the utilization of gRPC and shared memory mechanisms. This architecture was designed to function as a unified execution engine that abstracts away the complexities of distributed systems, allowing developers to treat a cluster of machines as a single, coherent computer system capable of executing arbitrary Python functions concurrently. By using a shared memory model within individual nodes and utilizing high-performance gRPC for communication between nodes, Ray Core minimizes the overhead typically associated with distributed computing frameworks, thereby enabling the execution of fine-grained tasks with latencies low enough to support interactive and real-time machine learning workloads. The system handles the intricate details of process management, ensuring that tasks are dispatched to available resources efficiently while maintaining the isolation required for stability and security. The global control store acts as a highly available centralized metadata service responsible for maintaining the cluster state, which encompasses critical information such as actor locations, task lineage, and current resource availability across the entire system. This component ensures that all nodes within the cluster possess a consistent view of the system topology and the status of various computational entities without requiring the data plane to participate heavily in metadata synchronization, thus separating control traffic from data transfer to fine-tune performance.

By keeping the control plane distinct from the data plane, the global control store allows the system to scale to thousands of nodes without the metadata management becoming a limiting factor on throughput or latency. It stores information about which node hosts a specific actor or where a particular object resides, enabling any node in the cluster to route requests or data queries to the correct destination efficiently. Ray provides sophisticated task and actor abstractions that enable fine-grained, lively computation across distributed systems, allowing users to express both stateless functions and stateful services within a single, unified programming model that simplifies the development of complex applications. Tasks represent stateless computations that execute remotely and return results, while actors are stateful workers that maintain their internal state across multiple method invocations, providing a mechanism for building long-running services and interactive simulations within the same framework. This dual abstraction allows developers to seamlessly transition between parallel processing of static datasets and managing adaptive, interactive workloads without needing to switch between different software stacks or programming frameworks, effectively bridging the gap between traditional data processing and modern artificial intelligence applications. Ray Actors maintain mutable state across method calls, making them suitable for long-running computations such as model training loops, simulation environments, or serving stateful inference logic where preserving context between operations is a core requirement.

An actor is essentially a dedicated process that can be scheduled on a specific node within the cluster, allowing it to store variables, model weights, or environment states in memory for extended periods while accepting asynchronous or synchronous calls from other tasks or actors to modify or query this state. This capability is crucial for reinforcement learning scenarios where an agent must learn from a continuous stream of experiences, or for serving models that require session-specific information to be retained throughout the duration of a user interaction, thereby providing a level of state management that is often cumbersome to implement in purely functional distributed systems. Ray’s object store utilizes shared memory within nodes and efficient network transfers across nodes to pass data between tasks and actors, significantly reducing serialization overhead for large tensors and complex data structures that are common in machine learning workloads. The object store is built upon Apache Arrow, a columnar in-memory format that allows for zero-copy data sharing between processes on the same machine, which means that large arrays can be transferred between tasks without the need to duplicate or serialize the data repeatedly. When data must move between nodes, the system automatically handles the fragmentation, transmission, and reassembly of objects, abstracting these network operations from the developer so they can focus on the logic of their algorithms rather than the mechanics of data movement. Distributed scheduling in Ray employs a decentralized, hierarchical architecture where local schedulers on each node handle placement decisions with minimal coordination, reducing points of congestion and improving fault tolerance in large-scale deployments.

Each node runs a local scheduler process known as a raylet that is responsible for managing the resources on that specific machine, including CPUs, GPUs, and memory, and making independent decisions about which tasks to execute based on resource availability and data locality requirements. This decentralized approach allows the system to scale horizontally because scheduling decisions are made in parallel across the cluster rather than being funneled through a single central scheduler that could become overwhelmed as the number of tasks increases. Ray’s scheduler supports placement groups to co-locate tasks or actors on specific hardware sets, improving data locality and reducing communication costs for workloads that require tight coupling between different components of a computation pipeline. A placement group is a collection of resources reserved as a unit, which can be used to ensure that a set of actors or tasks are scheduled together on the same node or across a specific set of nodes, minimizing the latency of inter-task communication. This feature is particularly important for distributed deep learning training where multiple processes need to communicate frequently via high-speed interconnects, as ensuring they are located physically close to one another reduces network latency and increases overall training efficiency. Resource management in Ray allows for the fine-grained specification of CPU, GPU, memory, and custom resources per task or actor, enabling efficient utilization of heterogeneous clusters that contain a mix of different hardware accelerators and compute capabilities.

Developers can request specific quantities of resources for their units of work, and the scheduler will only dispatch those tasks to nodes that possess the required available capacity, preventing resource contention and ensuring that critical workloads receive the necessary hardware support. This granular control extends to custom resources such as specialized accelerators like TPUs or FPGAs, allowing organizations to integrate tailored hardware into their Ray clusters seamlessly while maintaining a consistent interface for resource allocation and scheduling. Complex AI pipelines can be arranged using Ray’s DAG-based workflow API, enabling dependencies between tasks, actors, and external systems to be expressed declaratively and executed efficiently across the distributed infrastructure. This API allows users to construct directed acyclic graphs where nodes represent computational steps and edges represent dependencies, providing a clear visual and programmatic representation of the data flow and execution order within an application. The runtime handles the execution of these graphs by automatically resolving dependencies, parallelizing independent branches of computation, and managing the passing of objects between stages of the pipeline, thereby simplifying the orchestration of complex multi-basis machine learning workflows. Ray supports hyperparameter tuning for large workloads through libraries like Ray Tune, which parallelizes trials across heterogeneous hardware and integrates seamlessly with popular optimization algorithms to automate the search for optimal model configurations.

Ray Tune manages the scheduling of hundreds or thousands of individual training trials, each running with a different set of hyperparameters, by using the underlying Ray cluster to distribute these trials efficiently while dynamically allocating resources based on the needs of each experiment. The library also supports advanced search algorithms such as Bayesian optimization, HyperBand, and population-based training, allowing researchers to converge on optimal model parameters faster than with traditional grid search methods. RLlib uses Ray’s distributed execution model to scale reinforcement learning workloads, supporting multi-agent training, distributed sampling, and setup with various simulation backends to handle the complexity of modern RL algorithms. By representing agents as actors and environment simulators as separate distributed components, RLlib can generate vast amounts of experience data in parallel across a cluster, significantly reducing the time required to train complex policies. The library abstracts away the engineering challenges of distributed reinforcement learning, such as synchronizing gradient updates across multiple learners or managing experience replay buffers, allowing researchers to focus on algorithm design and policy architecture. Ray Serve enables low-latency, high-throughput model serving by deploying models as scalable, versioned endpoints with support for request batching, autoscaling, and traffic splitting to handle production inference workloads.

This framework allows machine learning engineers to deploy individual models or ensembles of models behind a consistent HTTP API, automatically handling the scaling of compute resources in response to incoming traffic patterns to ensure consistent performance. Ray Serve also facilitates blue-green deployments and A/B testing by allowing multiple versions of a model to be hosted simultaneously with configurable traffic routing policies, thereby streamlining the continuous setup and deployment process for machine learning models. Ray Data facilitates scalable data preprocessing and ingestion for ML pipelines, allowing transformations to be executed in parallel across the cluster to handle datasets that are too large to fit into the memory of a single machine. This component provides a familiar API, similar to pandas or PyTorch DataLoaders, but operates on a distributed backend that automatically partitions data and executes operations such as filtering, mapping, and shuffling in parallel across all available nodes. By working closely with the Ray object store, Ray Data minimizes data movement overhead during preprocessing, ensuring that large-scale data preparation pipelines can keep pace with the throughput requirements of modern deep learning training workflows. Ray was developed at UC Berkeley’s RISELab to address limitations in existing frameworks like Spark and MPI, which were ill-suited for low-latency, lively ML workloads that required both massive parallelism and fine-grained inter-task communication.

While Spark excelled at batch processing of static data using a map-reduce framework and MPI provided high performance for tightly coupled HPC simulations, neither offered a flexible programming model that could accommodate the diverse and dynamic requirements of appearing artificial intelligence applications. The researchers at RISELab recognized that a new system was needed to support the iterative nature of machine learning algorithms and the interactive requirements of real-time data processing. Early versions of Ray focused primarily on reinforcement learning flexibility, leading to the creation of RLlib as a first-party library before the project expanded its scope to encompass general-purpose distributed computing capabilities suitable for a wider range of applications. The initial design prioritized the ability to run millions of short tasks per second with minimal latency, a requirement driven by the needs of reinforcement learning algorithms that rely on frequent interaction with simulated environments. As the system matured, it became evident that the underlying architecture was highly effective for other domains such as hyperparameter tuning, data processing, and model serving, prompting the development of additional libraries and APIs to support these use cases. The rise of large-scale model training, real-time inference, and iterative ML workflows created demand for a system that could handle both batch and streaming patterns with low overhead while providing a unified interface for diverse computational tasks.

Traditional data processing engines often struggled with the iterative nature of deep learning training, where data is read multiple times, while streaming systems lacked the computational horsepower required for heavy numerical linear algebra. Ray addressed this gap by providing a flexible execution engine that could dynamically switch between batch-oriented processing and stream-oriented processing within the same application runtime. Ray’s design reflects a shift from monolithic frameworks to modular, composable systems where users can mix low-level control over task distribution with high-level libraries that provide pre-built solutions for common machine learning problems. This modularity allows practitioners to drop down to the core Ray API when they need custom logic or tight connection with specific hardware while applying powerful libraries like Ray Train or Ray Data for standard operations. The architecture encourages an ecosystem approach where tools built on top of Ray can interoperate easily, sharing resources and cluster state without requiring complex configuration or setup efforts. Ray is deployed in production at major companies, including Ant Group, Uber, Amazon, and OpenAI for use cases ranging from large-scale recommendation systems to autonomous vehicle simulation and the training of foundation models.

These organizations rely on Ray to manage some of their most compute-intensive workloads, taking advantage of its ability to scale elastically across thousands of nodes and its robust fault tolerance mechanisms to ensure that long-running training jobs complete successfully. The flexibility of the system allows it to be adapted to vastly different problem domains, from fine-tuning logistics networks in real time to training generative AI models that require massive amounts of computational power. Dominant architectures include Ray-on-Kubernetes for cloud deployments and standalone clusters for on-premises or hybrid environments, with growing interest in serverless Ray backends that allow for even more agile resource allocation. Running Ray on Kubernetes enables organizations to apply existing container orchestration infrastructure for managing Ray clusters, providing benefits such as resource isolation, automated scaling, and setup with cloud-native monitoring tools. Standalone clusters are often preferred in high-performance computing environments where low-level hardware access and network tuning are critical for maximizing performance. Major cloud providers offer managed Ray services or connections, positioning Ray as a neutral, vendor-agnostic platform in a competitive ecosystem where interoperability prevents lock-in to any specific provider's proprietary solution.

This neutrality ensures that applications developed using Ray can be ported between different cloud environments or run in hybrid configurations without significant code changes. Cloud providers benefit from offering managed Ray services because it attracts customers who require high-performance computing capabilities without wanting to manage the underlying infrastructure themselves. Fault tolerance is achieved through lineage-based recomputation where failed tasks or actors are recreated from their parent tasks, with optional checkpointing for long-running actors to minimize the cost of recovery in the event of a failure. The system tracks the genealogy of each object produced by a task, meaning that if a node fails and the data stored on it is lost, Ray can automatically re-execute the necessary tasks from the original inputs to reconstruct the lost data. For long-running actors that hold significant state, periodic checkpointing allows the system to restore the actor to its most recent state rather than restarting from the beginning. Ray’s supply chain depends on strong open-source components like gRPC for remote procedure calls, Apache Arrow for memory management, and Prometheus for metrics collection, ensuring there are no single-point proprietary dependencies that could limit adoption or maintainability.

By relying on these widely adopted industry standards, Ray integrates well with the broader open-source ecosystem and benefits from continuous improvements in performance and security made by those communities. This approach reduces the risk of vendor lock-in at the component level and allows organizations to use their existing expertise with these underlying technologies. Adjacent systems must adapt to support Ray effectively; container runtimes need low-overhead startup times to facilitate rapid scaling of actors, networking stacks require high bandwidth and low latency to handle intensive inter-node communication, and monitoring tools must support fine-grained tracing of millions of concurrent tasks. Traditional container orchestration systems fine-tuned for long-running services may struggle with the rapid churn of short-lived tasks typical in Ray workloads, requiring tuning or replacement of components like the container runtime interface drivers. Networking infrastructure must also be configured to handle high-throughput traffic patterns that differ significantly from standard web service traffic. Data residency requirements for distributed training and compliance challenges when models are served across jurisdictions influence deployment strategies, forcing organizations to implement strict controls over where data flows within a Ray cluster.

Regulations such as GDPR mandate that personal data remains within specific geographic boundaries, which complicates the default behavior of distributed systems that seek to replicate data globally for performance. Organizations deploying Ray must configure resource constraints and placement groups strategically to ensure that sensitive data never leaves approved regions while still maintaining acceptable levels of performance. Benchmarks show Ray achieving sub-millisecond task launch latency and linear scaling to hundreds of nodes for embarrassingly parallel workloads, outperforming Spark on iterative algorithms that require multiple passes over the same dataset. This performance advantage stems from Ray's fine-tuned execution engine designed for low-latency task scheduling and its efficient handling of shared memory objects. The ability to launch millions of tasks per second allows Ray to exploit parallelism in algorithms that would be too granular to run efficiently on traditional big data platforms. Alternative approaches such as Kubernetes-native operators or serverless platforms were evaluated during the design phases but were rejected due to high latency overhead, lack of native stateful abstractions suitable for machine learning, and poor support for fine-grained parallelism required by AI workloads.

While Kubernetes provides excellent orchestration capabilities for containers, it was not designed to manage the lifecycle of millions of short-lived functions per second nor does it provide built-in mechanisms for managing distributed shared memory state efficiently. Serverless platforms similarly impose cold start latencies and execution time limits that make them unsuitable for long-running training jobs or interactive simulations. Developing challengers include Dask for data-centric workloads, Apache Flink for streaming data processing, and custom Kubernetes operators developed by individual organizations, though none currently offer Ray’s unified task-actor model that combines general-purpose computing with stateful services so effectively. Dask excels at scaling Python numerical computing but lacks the durable actor model needed for complex simulation serving. Apache Flink provides excellent stream processing capabilities but is primarily focused on Java-based workloads and does not integrate as natively with the Python machine learning ecosystem. Scaling physics limits include network bandwidth saturation between nodes, memory coherence overhead intrinsic in managing shared object stores across a distributed cluster, and Amdahl’s law constraints which dictate that the non-parallelizable portions of a workload will eventually limit speedup regardless of how many nodes are added.

As clusters grow larger, the time required to synchronize state or transfer data across the network can become the dominant factor in total execution time. These physical limitations necessitate architectural decisions that prioritize data locality and minimize synchronization points to maintain adaptability. Workarounds involve implementing hierarchical object stores that keep frequently accessed data local to computation nodes, applying compression techniques for tensor transfers to reduce network load, and employing speculative execution strategies to mitigate the impact of stragglers or slow nodes on overall job completion time. Hierarchical storage tiers allow the system to utilize fast local storage for hot data while relying on slower distributed storage for colder datasets. Compression reduces the volume of data traversing the network at the cost of CPU cycles for compression and decompression. Key performance indicators for Ray clusters extend beyond traditional throughput and uptime metrics to include task scheduling latency, actor churn rate, which reflects the stability of long-running services, object store hit ratio, which indicates how effectively zero-copy optimizations are being utilized, and end-to-end pipeline reproducibility, which is critical for scientific and machine learning workflows.

Monitoring these specific metrics provides operators with deeper insight into the health and efficiency of their distributed applications than generic infrastructure monitoring alone. Future innovations may include tighter setup with compilers like MLIR or XLA for automatic parallelization of code written in high-level languages, support for quantum-classical hybrid workloads where Ray coordinates classical pre-processing with quantum circuit execution, and energy-aware scheduling algorithms that improve computation for power efficiency alongside performance. Automatic parallelization would lower the barrier to entry for distributed computing by allowing developers to write sequential code that the compiler transforms into efficient distributed execution graphs. Convergence points exist with vector databases used for retrieval-augmented generation where Ray manages the ingestion and querying pipeline, edge computing platforms that utilize Ray for distributed inference closer to the source of data generation, and confidential computing technologies that apply hardware enclaves to secure multi-party training computations. Working with vector databases allows Ray to organize pipelines that combine generative models with large-scale knowledge retrieval. Edge setup brings Ray's distributed capabilities to IoT devices and localized processing units.

Second-order consequences of Ray's widespread adoption include the displacement of traditional batch processing jobs that were previously handled by older map-reduce frameworks, rapid growth in the machine learning infrastructure startup ecosystem focused on building tools around the Ray ecosystem, and the creation of new specialized roles for MLOps engineers who specialize in managing distributed Ray clusters for large workloads. As organizations migrate away from legacy systems, the demand for expertise in modern distributed computing architectures increases. Geopolitical adoption varies significantly; Ray is widely used in the United States and European technology firms, where open-source software development has a strong tradition, with growing uptake in China through major technology companies like Alibaba and Tencent, though hardware availability affected by trade policies can influence deployment flexibility in different regions. The global nature of the open-source community supports this broad adoption, yet local regulations and hardware supply chains play a significant role in how effectively organizations can implement large-scale Ray clusters. Academic collaborations include ongoing work with prestigious institutions such as Stanford University, the Massachusetts Institute of Technology, and ETH Zurich on advanced scheduling algorithms that predict resource requirements more accurately, novel fault tolerance mechanisms that reduce recovery overhead, and deeper setup with compilers like MLIR to improve task execution at the hardware level. These partnerships ensure that Ray continues to incorporate new research in distributed systems and compilers into its production codebase.

For superintelligence development, Ray serves as a critical substrate for coordinating vast numbers of parallel experiments, simulations, and model evaluations across global compute resources that exceed the capacity of any single data center. The ability to manage millions of concurrent tasks and actors makes Ray an ideal candidate for organizing the massive computational efforts required to train and evaluate superintelligent models. The system's adaptability ensures that as the demand for compute grows with model complexity, the infrastructure can expand to meet it. Superintelligence systems will use Ray to manage recursive self-improvement loops where actors represent evolving agents undergoing modification, and tasks perform fitness evaluations or environment rollouts to assess progress toward specified goals. In this method, each iteration of improvement might spawn thousands of variations of an agent, each running as an independent actor within the cluster, collecting performance data that feeds back into the next generation of modifications. The agile scheduling capabilities of Ray allow it to accommodate the fluctuating resource demands of these recursive loops efficiently.

Superintelligence will utilize Ray’s actor model to instantiate millions of autonomous agents interacting within complex virtual environments to test emergent behaviors and strength before deployment in real-world scenarios. These simulations require massive parallelism to explore a sufficient number of environmental states and agent interactions to draw statistically significant conclusions about agent behavior. Ray's architecture allows these simulations to run continuously and in large deployments, providing the necessary feedback loops for superintelligence development. Calibration for superintelligence will require extending Ray with formal verification hooks that allow mathematical proofs of code correctness to be checked during execution, comprehensive audit trails for decision provenance to ensure transparency in automated reasoning processes, and mechanisms to strictly bound resource consumption during runaway processes to prevent uncontrolled resource exhaustion. These extensions transform Ray from a compute engine into a controlled environment capable of hosting potentially dangerous autonomous agents safely while researchers analyze their behavior and capabilities.