When the Model’s Done Talking

Welcome to my new series on the backbone of modern artificial intelligence: AI infrastructure. I’ll begin with a high-level look at the infrastructure powering today’s biggest AI models, from Meta’s GPU-packed clusters to Google’s custom TPUs.

AI infrastructure is a whole different beast compared to the systems we’re used to building. It challenges assumptions around networking, scheduling, storage, reliability, and cost-efficiency.

Follow along with Pippo, a curious engineer, as he asks all the right questions, and Ulysses, a seasoned infra expert, who’s got the answers. Together, they’ll guide us through the complex systems that make AI run at scale and explore how the GenAI era is redefining what it means to build for the future.

Cover image

Pippo: Yo, every week it’s something new. GPT-5 rumors, Claude getting smarter, Gemini scaling like crazy. These models just keep leveling up. But what I really want to know is: how the hell are they even running all this under the hood?

Ulysses (smirking): Glad you asked. I just read a couple of solid pieces this week, from Meta on the infrastructure behind their GenAI workloads, and from Google about Ironwood, their latest TPU architecture. Wild stuff.

Pippo: Figures. So where does it all start? What’s actually different from the kind of infra we manage day to day?

Ulysses: Let’s start from the beginning with the hardware. In traditional systems, most workloads run on general-purpose CPUs. They’re flexible but not built for the highly parallel computations AI needs. That’s where GPUs and TPUs come in. GPUs, like the ones Meta uses in their GenAI clusters, are optimized for parallel processing and are perfect for training large AI models.

Pippo: Right, GPUs I get. But TPUs, is Google’s thing, right?

Ulysses: Yep. CPUs have been around since the 1950s, GPUs since the late ’90s. But as AI models got bigger, Google needed more performance per watt and per dollar. So, they developed their own custom chips, TPUs, short for Tensor Processing Units. The first version was deployed internally in 2015 and made a huge impact across Google products. Oh, and “tensors” are just the fancy name for the data structures AI models use, think multi-dimensional arrays. A lot of math happens behind the scenes to make things like image generation or language models actually work.

Pippo: Got it. So what’s the deal with these new TPUs? Are they actually that much better?

Ulysses: Yeah, Google’s Ironwood TPUs are next-level. They’re designed specifically to optimize AI inference workloads and apparently each pod can scale up to 9,216 chips, delivering a staggering 42.5 exaFLOPS of peak compute power. Yep, that’s 42.5 billion GigaFLOP. Compared to the prior generation Trillium, Ironwood offers 5 times more peak compute capacity and 6 times the high-bandwidth memory. Plus, it’s twice as power-efficient, which is obviously crucial for large-scale deployments.

Pippo: That’s a huge leap. But is it just about more compute?

Ulysses: Not exactly. It’s the way compute is combined with a tightly integrated system design. Google built Ironwood by co-designing the TPU hardware, the compiler stack (like XLA), and the datacenter architecture as one unified system. In other words, Google’s basically abstracting away traditional infrastructure limitations. They’ve redesigned how data moves through the system, how scheduling happens, and even how the software stack interacts with hardware. It’s a vertical integration play: from chip design all the way to model execution. So, instead of you worrying about things like “How do I shard this model across hundreds of devices?” or “How do I minimize cross-node communication?”, Google’s stack, from TPUs to compilers to orchestration, handles it for you.

Pippo: Damn. I’m guessing Meta and the rest are doing something similar?

Ulysses: The end goal is the same but their approaches differ quite a bit. As you can guess, it’s not just about stacking more GPUs, the real challenge is getting them to talk efficiently. Meta, for instance, uses RoCE-based networking where RoCE stands for RDMA over Converged Ethernet. Alright, alright, I see those furrowed eyebrows, let’s break it down step by step. RDMA stands for Remote Direct Memory Access. It allows a computer to access the memory of another machine directly, without involving the CPU or the usual networking layers like TCP/IP. That means you get super low latency and very efficient data transfers.

Pippo: So, no more “slow” trips through the CPU and kernel?

Ulysses: Right. In traditional setups, data transfer might involve latencies of tens of microseconds, sometimes more, because every packet has to go through the operating system and CPU. But with RDMA, especially over RoCE or InfiniBand, you’re looking at single-digit microsecond latencies, sometimes even under 1µs in optimized InfiniBand setups. That’s 10x to 50x faster, which makes a massive difference when GPUs are exchanging gigabytes of tensors thousands of times per second.

Pippo: InfiniBand? I’ve heard the name, but what’s the deal with it?

Ulysses: Think of InfiniBand as a high-performance networking tech built specifically for extreme-speed data transfer, like the kind you need in supercomputers or AI clusters. The downside? It’s more specialized and expensive to deploy.

Pippo: Gotcha. Is Meta mixing both then?

Ulysses: Exactly. And when you’re scaling to thousands of GPUs, those latency savings compound fast. That’s why Meta’s infrastructure leans on both RoCE and InfiniBand depending on the workload: RoCE gives them scalability over Ethernet, which is more flexible and cost-effective. InfiniBand, on the other hand, is there when they need that raw, deterministic speed for the most demanding training jobs. Smart trade-off, honestly.

Pippo: Got it. So they’re basically designing networks that can move huge chunks of data between GPUs at lightning speed?

Ulysses: Spot on. It’s not about replacing the whole networking stack, but optimizing key data paths inside the data center for high-performance workloads.

Pippo: So, is it all about speed?

Ulysses: Nope, moving data fast is obviously part of the puzzle, but doing it at scale, with consistency, is the real challenge. That’s why Meta uses a CLOS topology in their GenAI infrastructure. Think of it like a highway grid where every GPU can communicate with every other GPU quickly and reliably. No bottlenecks, no traffic jams, just a smooth, scalable backbone that makes massive parallel training actually possible.

Pippo: Ahhh, so instead of one big central switch, they have a scalable mesh?

Ulysses: Exactly. And it’s not just about raw performance. It’s about ensuring predictable, consistent communication. Meta calls these whole environments AI Compute Clusters (AICCs) which tightly integrate compute, storage, and networking, almost like a mini data center optimized purely for AI.

Pippo: Speaking of storage, how do they keep up with all that data flying around?

Ulysses: Good point. They use a tiered storage architecture. Hot data, which needs fast access, lives on NVMe drives near the compute. Cold data is stored further out in larger systems. The goal is to feed GPUs as quickly as possible without making them wait for data.

Pippo: That makes total sense. GPUs are expensive, you don’t want them idle.

Ulysses: Exactly. Idle GPUs = wasted money. That’s why Meta’s GenAI infrastructure is designed to move data quickly and efficiently from storage to compute, with ultra-low latency networking between them. Every part of the system is tuned for large-scale AI training and inference.

Pippo (nodding): Wild. It’s like traditional infra on steroids, and with different trade-offs.

Ulysses: Yup. AI infra is evolving fast. For cloud providers like Google, it’s a shift away from general-purpose. For Meta, it’s more like a next phase, from web-scale infra to GenAI-scale infra. They’re co-designing compute, storage, and networking even more tightly now, all with one goal: scale models faster, make them smarter, and keep the costs from exploding.

Pippo (taking the last sip of coffee): Alright, back to the grind. But next time, I want to hear more about how they orchestrate all this. I mean, how do you even schedule training jobs at that scale?

Ulysses: Deal. Let’s keep peeling back the layers next time, while grabbing another amazing coffee, obviously.

References:#

References: