AI Infrastructure Talk

Was recently invited to present at the 114th SF Hardware Meetup on challenges and opportunities in AI infrastructure. A number of you reached out lamenting the two-hour queue of attendees asking questions afterwards. I’m sorry to those who couldn’t wait. Please connect with me and let’s chat.

Brief Q&A recorded following talk (warning: poor audio quality):

Transcript:

Slide 1: Hi everyone, my name is Justin and I’m an investor over at Eclipse Ventures. We’re a deep-tech VC firm investing across all stages of companies. I focus specifically on Pre-Seed through Series A companies. But I have to say, most of my career I did not spend as an investor. Actually, I spent most of my time as an embedded software engineer. Before joining Eclipse, I was one of the first kernel engineering hires over at Cerebras Systems. And so for the first half of my talk, I’ll talk about my experience at Cerebras. Then use those insights as an investor to talk about, where I see the challenges and obstacles facing us in building up all this new AI infrastructure.

Slide 2: So to start, if I were to sum up Cerebras in one slide, this would be it. What we did was what the industry had been trying and failing to to do for many, many decades, which was to build the largest computer chip possible allowed by fab processes. So we cut the largest piece of silicon we could out of a wafer and made that our chip. And as you can imagine, this is a really, really complex and difficult engineering challenge.

Slide 3: But even moreso than that was building systems to power, cool and feed i/o to this chip. A lot of people don’t realize it, but this is actually the harder part, because the chip at peak draws 30 kilowatts – the equivalent of a small vehicle. The bottom right side, you see what it looks like a person in person, it looks like a small mini fridge, it’s almost the size of a human being, about a third the size of a datacenter rack. My job was given this hardware, to answer the question: how do we make it useful? And what that meant in practice was writing lots and lots of assembly code, or “kernels” to implement the neural networks for this system. And so what happened was, I became responsible for onboarding new kernel engineers and it was taking something on order of 6 - 12 months before someone could actually produce a functional performant kernel, implementing a useful neural network operation. And it came to a point where we were asking ourselves: at this rate, how in the world are we going to compete against NVIDIA? We needed to write all these kernels to implement all these neural networks that users wanted to run on our system.

Slide 4: NVIDA at this point in time, it was about 2018 or 2019, had this 20 year lead with CUDA. And the only way we could compete is if we built a competitor, or if we built a better compiler – so we tried to do both, and I made it my personal mission to try to build a direct CUDA competitor… which was truly naive. But as a result of that, we produced what’s now known as the Cerebras SDK. The jury’s out as to whether or not it’s actually is a good enough competitor to CUDA. But as a result of this work, we built a really killer app for Cerebras systems, which is in high-performance computing.

Slide 5: These two papers here showed some really interesting landmark work that came out of our work at Cerebras. The left side here was work that we did showing, with the SDK using about 32 systems, we could achieve higher bandwidth than any other supercomputer out there on these seismic simulation workloads. For that we placed as a finalist in the Gordon Bell Prize. And on the right side here, was work we did in conjunction with TotalEnergies to show that with a single wafer, we could do 200 times faster CFD simulation than any size supercomputer, doesn’t matter how many nodes in the data center you have. And for that we made very stern VP over at Total a very happy buyer.

Slide 6: So that brings me to the topic of this talk, which is: what are the challenges and obstacles to scaling compute for AI? And when Michael sent me this prompt, I got really, really excited because if there’s anything I’ve been thinking about for the past year as an investor, it’s: what are all things in front of us that are preventing us from getting more compute into the hands of users? And things you hear about are, for instance availability. You know, last year, if you wanted to get an H100, you’d need to put in a two to three year reservation just access that compute. Reliability – up and down the hardware-software stack. The mean time between failures for one GPU maybe a really long time, but if you multiply that across thousands of GPUs, you’ll get a failure every hour. Software-level there’s all these hallucinations, and people are trying to build better eval systems so that we can make these models more useful to end users. Ease-of-use, you don’t want to have to change your code to run on a new AI chip, so you have to build a new compiler to do that. TSMC’s chip-on-wafer-on-substrate process, as we know it’s bottlenecking production of HBM which is a key piece of technology for AMD GPUs. Yada yada yada… all these issues. But I realized that I actually don’t need to tell you this because you can read this in any New York Times article or listen to your favorite All In podcast.

Slide 7: But what I can tell you is: if I were to do it all again, what would I do? And implicit in that question is this question of: do we actually need more compute?

Slide 8: If anyone knows what this is, shout it out. Yeah, that’s an optical transceiver. So this is right now the backbone of the highest performance networking in the data center. And if you look at GPU workloads in practice, you’re actually getting at most maybe 50% hardware utilization. And so the reason these are useful is because these devices can help us feed our really hungry compute with more data at the higher bandwidths that we need. If you were to, say, saturate a DGX H100 at peak speeds with traditional passive copper cabling and its associated networking equipment, you’d end up expending a lot of energy per bit to transfer that data. And so if you move to active copper and to eventually optical fiber, these are the best technologies we can use now to transfer information at a lower energies per bit. There’s a whole bunch of tradeoffs that come from doing that, but this in my view, this is going to be a huge cost center for AI datacenter buildouts – not just compute but networking as well.

Slide 9: Another large cost center will be this. This is an example of a next gen direct-to-chip smart cooling plate. It takes a very targeted approach to feeding liquid coolant directly to the areas of a chip that dissipate the most heat. The alternative is what’s currently widely done which is to use large heat sinks that cover large areas of a chip, then pass air or liquid coolant through resulting in a less efficient system. I put this here because this is just one example of the sorts of creative solutions people are applying in order to maximize compute performance. And without effective cooling, you end up with more hardware reliability and performance issues. Another thing that we hear about a lot is immersion cooling where you dip your whole system into a dielectric fluid, which is great for dissipating heat, but is a mess when it comes to serviceability and maintenance. This and all the cooling infrastructure will also be a huge cost center for datacenters.

Slide 10: So just to bring this point home, you know, computing is really just one part of the story. Bottom left hand side here is one Cerebras compute system. And here’s just a sample of all the different infrastructure that you need to get this compute working. You need all these external servers, external memory, smart NICs and cables, chillers and pumps. You may need heterogeneous compute to serve different workloads. This is just to drive home the point that compute is really just one part of the story.

Slide 11: So you might ask, okay, well, we can scale compute as long as we throw enough money at the problem. Why can’t we just build more data centers? And what we’re hearing now is actually the answer is no… and good luck find supply. I was talking to a number of data center sales executives over past week in preparation for this talk and I learned some really staggering statistics, like 95% of data center capacity over the next few years is already consumed in the United States. And it takes only six days for a data center to come online and for someone to buy out the lease. It’s an enormous clip these data centers are being consumed. And the reason as we all hear is that there’s this limitation on the availability of power to data centers. But what most people don’t realize is that it’s not actually about the generation of power, it’s about the distribution of power. It’s about utilities being able to upgrade distribution lines and the time associated with that. So earlier this year, Amazon did this thing where they bought a nuclear power plant next to a data center in Pennsylvania – almost like a private utility – in order to get the power distribution they needed.

Slide 12: And without being prescriptive of a solution, the big opportunity area I’d suggest for entrepreneurs to think about is the question of: How can we can do more with less? For instance, instead of adding bandaids to supply enough power to our compute, can we look from the other side and make compute itself much more efficient? Importantly, can we do this without compromising performance or making it much harder for engineers to use it?

Slide 13: Then you might say, okay we get it, hardware is hard, is that it? Well, you also have to consider the software story. A big challenge we faced at Cerebras was building a compiler that could compete on par with CUDA. What I suggest to entrepreneurs trying to navigate this space is: don’t try to compete directly with NVIDIA. Can you instead find a domain-specific application in which you can vertically integrate to perform several orders of magnitude better than CUDA because it’s intended to be a general platform? But technologists don’t like thinking about that, because it means they need to trade off a technology dream and start thinking about markets. And maybe they can’t become the coveted platform that would garner 50x revenue multiples. So I’ll say, okay that’s fine, but if you want to take the horizontal platform approach, you’re going to have to recognize that compiler generalizability is a huge obstacle to overcome.

Slide 14: So they’ll say: fine, I’ll hire great compiler engineers, is that it? Well, probably the biggest regret that I have from my time at Cerebras, the thing I truly believe would’ve helped us scale more quickly, was not figuring out the debugging story earlier. We actually started our debug team about a year into building a functional software stack. And for a new a new chip with a new ISA and millions of cores, that was a mistake. In retrospect, I would start that team right alongside the compiler and systems software teams.

Slide 15: Okay, so you might say, fine, we’ll build a compiler, our debug tools, we’ll build our hardware and datacenters.. is that it? There can’t be anything else in front of us that’s blocking us from putting out more compute in the world, right?? Sorry to say that right now, the number one problem facing AI chip companies in getting out into the market is software distribution. Why is that? Any enterprise that is going to start an AI project is going straight to what they know: AWS, Azure, GCP. Their data is already there. They are already getting free credits. And if they want specialized compute, AWS might say: we can give you specialized compute, we have Inferencia. So users can scratch that itch that way.

Slide 16: So there’s this question of: as we move into the future of more and more heterogeneous compute, how can we ease this problem of distribution? That is another great opportunity that I see for all of us to think about.

Slide 17: Now, going back to this question of: if I can do it all over again, what would I do? And implicit in that question was this question of: do we really need more compute? And having looked deeply at this problem, I’d say there’s a small emerging (and valid) camp of people – let’s call them the decentralized folks – who would say the answer is no. My steelman argument for that side goes something like this: If you look at how much enterprise-grade compute was procured over the past several years, a back-of-the-hand calculation yields capacity numbers in the zettaflops. That compute capacity, assuming at best 50% hardware utilization, is simply going unused. And so there’s companies like Together AI who originally were trying to make compute available by just allowing the public to access enterprise datacenters through a value-added software layer on top. Stability AI, for instance, before that whole thing happened, could’ve resold their compute back to the market and earned another $140M in revenue. I call this the problem of Stranded Compute.

Slide 18: I just wanted to end off with some concluding thoughts, which was: it feels to me like we’re at this precipice of this explosion of heterogeneous multiprocessor compute. Large companies are really betting on scaling laws playing out. They are buying all this data center capacity at this enormous clip. But most customers who are using this enterprise compute are still in the beginning phases. They’re still prototyping. And so to me, as an investor, it feels scary but it also feels exciting. It feels scary because there’s this question of: what if we’re in this environment of artificially scarce compute and the only people that can actually afford it are the big guys… so they’ll just keep getting bigger. But it’s exciting because if this demand continues, then that’s a huge promise for us as entrepreneurs and investors to build what we think the future of AI infrastructure should look like.

May 16, 2024 · talk