Rajeev Koodli
Following SoftBank Infrinia’s Public Preview at the Computer History Museum in Mountain View, CA in February, I am delighted to share this first Infrinia Blog!
Throughout my career leading engineering at Google, Intel, Cisco (Starent Networks), and Nokia, I have been fortunate to contribute to significant architectural shifts – from Internet Protocols to the (4G/5G) Mobile Internet to the Network Function Virtualization to the Distributed Cloud Systems. So, when we started SoftBank’s Silicon Valley site in early 2024 to tackle AI infrastructure and Large Models, we knew we were looking at a profound shift in technology and society at large. In our first year, we introduced SoftBank Large Telecom Model(LTM) and Transformer-based Wireless Signal Processing, our notable “Zero to One” milestones. See my previous blog: The Journey and Future Vision of our US Site.
Here, we will discuss the AI Infrastructure Software product we have been working on.
The industry is currently racing fast with the known architectures and toolsets for AI adoption. We are trying to run some of the most complex workloads in computing on top of a cloud architecture designed for stateless microservices. We already see challenges. If we want to unlock the true potential of Large Language Models (LLMs) and Generative AI, we have to reconsider treating the AI data center like a traditional cloud and fundamentally rethink how we manage the resources for AI workloads.
The New Reality of AI Workloads
To understand AI Cloud vis-a-vis classic cloud, let’s look at the microarchitectural realities of LLM workloads.
First, let’s consider training. Traditional cloud web application traffic is well-understood, typically smooth, predictable and North-South. LLM training, on the other hand, generates massive, synchronous “elephant flows” laterally across the network. When distributing a trillion-parameter model, thousands of GPUs must exchange gradient updates simultaneously. If the network drops packets or the GPU stack is flaky, the training loop stalls. We see the fallout of this in real-world environments: workload characterization studies show that up to 40% of large-scale LLM training jobs fail, due to factors like high-bandwidth interconnect (like NVLink) and CUDA errors under sustained peak load.
Then, there is the inference bottleneck. Serving inference is inherently conflicting for the infrastructure. It is partitioned into a “prefill” phase (ingesting the prompts, computing self attention), which requires massive parallel floating-point operations, and an autoregressive”decode” phase (generating the output tokens), which is heavily memory-bandwidth bound. In standard cloud deployments, we force both of these conflicting phases onto the same GPUs. The result is that expensive GPU compute cores often sit idle during the decode phase, waiting for the High-Bandwidth Memory (HBM) to feed them data.
These challenges posed by AI workloads, coupled with other classic cloud imperatives such as virtualization costs and significant data egress fees are providing decision makers an opportunity to rethink their AI Cloud strategy. According to a Barclays CIO survey data1, over 80% of enterprise CIOs are planning to repatriate relevant workloads. Today, the intensity of AI workloads has become a prime reason for this pivot from “cloud-first” to “cloud-appropriate” where enterprises reclaim costs and certainty with local and sovereign control.
Disaggregation & Constraint Graph
There are two major constraints we are tackling, one is the architectural pivot and the other is AI-native design itself.
Today, the hyperscaler stack for supporting AI workloads is vertical, tightly integrated. The other option is vendor-specific solutions for managing their GPU platforms. We think the AI Infrastructure software should be multi-platform, abstracting away the underlying GPU platforms. For customers buying GPU platforms, we believe in providing a software solution that turns those platforms into their own AI Cloud – the disaggregated software from the vertical, vendor-specific stack and offered as an independent Software Platform.
The second concerns the design of the Software Platform itself. As an example, we didn’t want to just retrofit open-source K8s which is topology-blind when it comes to workload placement: “Does Node A have enough available CPU and RAM? If yes, place the pod.”
But AI workload allocation is not a bin-packing problem; it is a complex constraint graph. To schedule an AI training job efficiently, the system must evaluate a host of variables such as node affinity, strict latency budgets, and the physical reality of the underlying fleet of GPU racks and the network. If the scheduler ignores this constraint graph and places highly synchronous training pods across a multi-hop network instead of a localized high-bandwidth rack, the latency spikes and the job fails.
These constraints and the nature of AI workloads drove our team to develop the Infrinia AI Cloud OS. We didn’t want to build a virtualization layer; we needed a Disaggregated Operating System that understands the underlying structure and operations of the AI data center. So, we focused on dueling design constraints (often seen in Pareto Frontier graphs), two of which we outline below, that underpin our offerings of Managed K8s and Inference-aaS:
- Simultaneously Meeting Customer and Operator SLAs: To solve the constraint graph, we have to move beyond static, heuristic scheduling. When provisioning node instances and end-points, the conflicting demands of maximizing the cluster utilization and optimizing the latency offered to the customer AI workloads pose non-trivial challenges. In addition, this demand needs to be implemented by dynamically reconfiguring network fabric (such as NVLink) and distributed memory access (such as Inter-node Memory Exchange). We believe learning-based approaches are the way to go to meet such dueling constraints.
- Simultaneously Improving Throughput and Latency: Another case of dueling constraints is maximizing user concurrency and minimizing the model performance latency per user. In inference, this is often seen as the tradeoff between supporting maximum number of users and Time To First Token (TTFT) per user. Disaggregated serving is generally accepted as the current best practice for this, yet doing so while scaling across models and in production remains a challenging engineering problem. We believe much remains to be done to achieve the (Pareto) sweet spot. As we build our OS to support AI workloads, we look to AI to better support AI workloads.
Looking Ahead
The “cloud-first” era for every single workload needs to be reconsidered. As enterprises and sovereign nations build out their own AI factories, the operational burden of managing complex GPU clusters is becoming the primary barrier to address.
If we want to build the next generation of trustworthy, autonomous AI systems, we have to bridge the gap between system software design and physical infrastructure. The GPU compute, network, the memory pools, and the OS must operate as a single, cohesive organism. That is the “Zero to GA” journey we are on, and we’re just getting started.
Rajeev Koodli leads the Infrinia Team and the Silicon Valley site of SoftBank Corp.’s US Subsidiary (SB Telecom America) in Sunnyvale, California.
1 https://a.storyblok.com/f/148396/59dbc1e91f/barclays_cio_survey_2024.pdf