Justin Selig

Long Live High-Performance Computing

Note: This is a cross-post between this site and Eclipse’s blog.


High Performance Computing (HPC) — also known as “supercomputing” — was once synonymous with any system consisting of tens to hundreds of CPU cores tied together in a datacenter. Today, most “HPC” headlines are focused on large-scale AI/ML systems, with Graphics Processing Units (GPUs) stealing the spotlight. Nowadays, young developers choose to use GPU-compatible AI frameworks like PyTorch as opposed to supercomputing programming languages like C/C++, reflecting the diminishing interest in traditional HPC. At first, the HPC world was skeptical: how could so much investment be funneled into a narrow set of tools and algorithms (attention-backed neural nets implemented with autodiff APIs) and see meaningful ROI? But this shift in preferences was reaffirmed when Cray, an iconic HPC hardware company, was subsumed by Hewlett Packard Enterprise in a 2019 acquisition on the optimistic basis that Cray’s technology would serve future AI applications. Now, there’s a surge in spending on GPU infrastructure to meet growing demand in AI. The reason: GPUs offer higher performance on dense linear algebra — the type of math powering neural networks — than traditional HPC hardware, yielding lower total costs for running these systems.

Taken together, the success of GPUs may seem as if they’re all that matters for AI, but they’re only one part of the equation. GPUs still rely on traditional HPC software and hardware infrastructure (CPUs, networking gear, storage, etc.) to deliver AI services. Originally designed for graphics workloads, GPUs are a relatively new layer of the AI stack, repurposed for AI because of their greater parallel computing capabilities. GPUs were simply in the right place at the right time — offering better compute performance where this was the bottleneck to results. GPUs continued to be used for AI not just because of software lock-in as many would expect, but because they represented the best available compute substrate at a time when NVIDIA’s promises of marginal gains in price/performance enabled downstream AI service providers to control costs without a lift-and-shift in infrastructure.

But now, GPUs have reached their limits, hitting memory-bandwidth and scalability walls, especially when working on large, complex applications like computational fluid dynamics, molecular simulations, or radar signal-processing. In scale applications where GPUs are architecturally constrained, performance and cost don’t always tip in favor of these devices. That’s why — while the world is fixated on the generative AI boom and GPU infrastructure — the broader field of HPC is quietly continuing to push the boundaries in pursuit of more powerful supercomputers, catalyzing the era of “exascale computing”. So, why do we still care about traditional HPC? Well, we find ourselves in a time when many companies are looking to AI to deliver future enterprise value. But companies –– especially those operating in complex physical domains –– need not look further than the past to find answers. By re-examining problems underserved by traditional HPC methods, new AI approaches find fertile ground. And with the leverage possible via new AI techniques and associated specialized hardware (eg. physics-informed ML, AI for Science), traditional HPC can continue to thrive. The combination of AI and HPC will enable engineers to tackle previously intractable problems that will transform our physical world.

The Future of AI Hinges on Programmable Hardware Platforms

To understand the potential impact of AI in physical industries, consider the previous recipients of the Gordon Bell Prize, dubbed the “Nobel Prize of Supercomputing.” Many of these groundbreaking achievements now employ heterogeneous (ie. variably specialized) hardware, focused on applications in the physical sciences with their results fed into tangible experiments or engineering work. The first prize was awarded in 1987 to a team of researchers who developed new parallel-computing algorithms for accelerating solutions to physical-world problems in wave mechanics, fluid dynamics, and structural analysis. The most recent prize in 2022 was awarded to a team that invented an extremely efficient method for using top-tier supercomputers to simulate laser-matter interactions, addressing critical challenges in high-energy physics and radiotherapy for tumor treatment. This year, my former team at Cerebras was chosen as a finalist for this award. Using Cerebras’ specialized AI hardware, Cerebras achieved record-breaking speeds on seismic algorithms used widely in the energy industry, at a fraction of the cost, complexity, and scale of the largest modern supercomputers.

As an early Cerebras employee, I feel immense pride when considering this achievement. Much like every other AI chip startup, Cerebras once excelled primarily on hardware, but overlooked the resource demands of software. The culprit: CUDA, NVIDIA’s flagship software platform for AI kernel development. NVIDIA’s success could be largely attributed to this almost 20-year old product. To this day, new AI models are built using CUDA by default in the community. As the goliath, NVIDIA had an outsized advantage, albeit rightfully earned. In the face of a giant, few dared to compete head-to-head with CUDA for AI use cases — and those who did failed miserably.

So, in 2019, I saw an opportunity to define a new approach to an open SDK for Cerebras. It was clear to me that developers might find ways to leverage Cerebras hardware in ways we simply couldn’t predict. And that exposing previously-hidden architectural elements would be net accretive to our business: allowing a broad developer community to contribute to our kernels, whilst enabling new fundamental HPC research that would characterize a perennial moat. The journey wasn’t without uncertainties, and it took a year to refine the vision and convince leadership to resource the project. But today, the team we assembled and the suite of products we built, collectively known as the Cerebras SDK, are finalists for the 2023 Gordon Bell Prize. The achievement: 48 Cerebras machines using the SDK performed faster than Frontier (the fastest supercomputer in the world) and were on par with Fugaku (which cost $1B to build). This was made possible not in spite of, but because it was implemented on the back of specialized AI hardware. It’s a clear example of the leverage afforded by AI innovation in traditional HPC.

Old Algorithms, New Tricks

Technology has always been cyclical, and as the saying goes, “What’s old is new again.” Algorithms and problems traditionally associated with HPC are now being recast within the framework of “AI.” What were once computationally-intensive problems in protein folding, are now simple inference problems for AlphaFold. Simulations that used to take months to study atomic interactions or yield novel materials can now be completed in minutes using physically-informed surrogate AI models. Similarly, what were once error-prone optimization methods for automatically routing electrical traces on circuit boards are now robust solutions addressed via reinforcement learning.

Still, AI represents the most exciting technology platform of our generation, and innovations in hardware — such as those found in Cerebras’ chip — act as catalysts to accelerate progress in what were once challenging HPC-exclusive problems.

At Eclipse, we work with companies at the intersection of AI and the physical world. We believe fundamentally that the impact of AI extends far beyond the world of bits. These are broad terms, so we often labor to identify the how and why, using our diverse experiences as operators to, in a sense, predict the future. Most people struggle to see where we go beyond higher-quality consumer-grade video generation models in the digital domain. However, I see a clear picture of the future in the world of atoms. It’s a future grounded in faster, better, cheaper solutions to longstanding problems addressed in traditional HPC. To gain inspiration for the future impact of AI, look to history. The list of past Gordon Bell Prize winners is a good place to start.

In short, the world depends on HPC and HPC is here to stay… although it wears a new face in the form of Python-backed linear algebra.

If you are a founder building at the crossroads of software and hardware, working on “AI for Science”, HPC, or developing physical-world applications with algorithmic innovations at their core, please connect and reach out to me.

· publication