Justin Selig

Top 10 Challenges Facing Chip Startups

Note: I wrote this for Eclipse’s blog.

Every so often, Cerebras CEO and Co-Founder, Andrew Feldman, rallies the team with a powerful reminder: “Once in a generation, a new workload emerges that drives the creation of a new wave of infrastructure companies. Today, that workload is AI.” Previously, cloud computing and storage drove data center innovation. Before that, it was personal and edge computing. And even earlier, the telecom industry prospered during the Internet scale-out era. Each wave reshaped an associated ecosystem of software and hardware, and now, AI stands as the catalyst for the next wave.

However, unlike previous workloads, AI is much more general, conceptually distinct from the infrastructure that composes it. AI is more than just ML — it encompasses a broad range of concepts, including classical algorithms, synthetic data generation, world modeling, agentic co-pilots, and computational creativity. With such a blanket phrase, one might as well describe “AI” as “fancy algorithms” — or if we want to get more specific, “fancy algorithms requiring lots of matrix multiplications that happen to work well on GPUs.” When looked at from this mundane perspective, of course AI is poised to be the “next big workload.” Humans will always want — and expect — their devices to be smarter.

However, infrastructure lags behind applications. And it’s important to understand specifically what underlies an application because value accrues to the infrastructure providers that prove themselves the best on these workloads in perpetuity. Diffusion models, for instance, have dominated public attention in recent years for their stunning creativity in image generation. But AI research is extremely dynamic. It’s not clear what exactly will drive sustaining innovation in all branches of “Visual AI” in years to come. Not to mention, current AI infrastructure falls short in performance on applications like video generation compared to language modeling. The world is much bigger than just the linear algebra at the heart of LLMs, and there’s a much bigger need for generality in compute. Only solutions that support the full breadth of generality in this nascent and ever-growing market will truly matter.

Startups are Critical to Chip Innovation

To make meaningful progress, the world needs new and better infrastructure. Now, it’s the responsibility of startups to make bold bets – to build the future they envision. It’s this level of risk that large companies are rarely willing to take on. Large chip companies will always promise incremental improvements along their dimensions of competency. But, we need new chip companies to: (1) Look at problems differently and from first principles to correct the fundamental incompatibilities of incumbents with current/future market demands; (2) Take risks on ideas that demonstrate orders-of-magnitude improvements, not just lower-risk incremental improvements.

At Eclipse, we’ve backed three such startups:

Building a new chip company today is difficult because it requires navigating a litany of hard technical challenges and fierce competition. Here’s a look at those challenges and how to navigate them:

1. When Selling Hardware, Solve for Total Cost of Ownership

In 2023, the demand for high-performance GPU reservations far exceeded supply to such an extent that, if you asked anyone building an AI software product what their main problem was, they would’ve told you “compute availability”. Now, as supply catches up, the issue has shifted to compute availability at the right price point. In this market, CTOs, CIOs, and MLEs/SWEs — often the same person in smaller organizations — closely scrutinize price-to-performance. While important, startups must consider that this is just one component of the calculus for total cost of ownership (TCO).

Buyers of compute hardware assess TCO by examining capital and operating expenses, including the computer system, datacenter infrastructure, electricity, and personnel costs for operating and building software. Since TCO varies by customer, companies must engage early with clients to understand their specific considerations.

It’s crucial to account for implicit costs, especially when your product provides enormous value, but not in the dimensions that are historically tracked in the market. Take Efficient’s chip – it consolidates previously disparate chip functions onto a single SoC, reducing motherboard component counts. While energy efficiency translates directly to electricity savings, parts consolidation cuts down on design and engineering efforts, which can be significant, but are often undervalued when measured in human labor hours alone. Reducing energy consumption also decreases the frequency with which users have to replace device batteries. Decreasing battery maintenance costs decreases OpEx for the customer, providing additional value. Traditional TCO metrics that only capture explicit costs would fail to capture these broader savings in supply-chain complexity, troubleshooting, maintenance, and serviceability.

2. Design for Debuggability Across the Whole Stack

Building a new chip architecture comes with many unexpected challenges, like having to reinvent all the tools users take for granted. Chief among these are debug tools. The reality is most software engineers avoid reading documentation, opting for a trial-and-error approach to development. Modern programming language toolchains spoil users with nicely formatted error messages and recommendations on how to fix bugs. It allows users to experiment without necessarily understanding specific coding nuances. The term for this trial and error approach to programming is “compiler-guided development.”

For a company primarily innovating in hardware, compounding engineering risk with unproven software is a mistake. When it comes to software, stick to established software behavior patterns. Not building a comprehensive compiler today is like creating a smartphone without a touchscreen — it just won’t meet expectations. At Tenstorrent, Cerebras, and Efficient, debugging is categorized into compile-time and runtime. Each company employs standard tools to meet users where they are with tools like LLVM, Clang, or MLIR. Runtime debugging is further divided into functional and performance debugging. Functional debugging during runtime is enabled with white-box methods like printf or gdb. Performance debugging tools are often bespoke and require careful product considerations.

As a general rule, debug tooling is a much more significant product endeavor than chip companies anticipate. It’s crucial to start these efforts early, designing hardware with robust debug capabilities from the outset.

3. Don’t Compete on Benchmarks, It’s a Losing Strategy

In 2018, I joined the first working group at MLPerf, now MLCommons, to standardize performance benchmarks for AI chip companies, anticipating a surge in AI hardware. Soon after, all the large chip companies dedicated teams of engineers to optimize and outcompete competitors on these benchmarks. It was a game fueled by resources from companies with the money and bandwidth to play it. Our strategy at Cerebras differed: set realistic expectations with customers and point them to examples of how our systems performed with real customer workloads. This approach is crucial for any chip startup competing in a crowded sector against resource-rich giants.

4. Anticipate Distribution Being Harder Than you Think

The biggest hurdle for chip companies with functional products is distribution. For AI chip companies, breaking into the public cloud has historically been a non-starter. Public cloud providers impose hefty software requirements, like multi-tenancy support, which take years for a chip company to develop. Plus, large cloud providers often build their own chip hardware to keep costs down, making it even tougher for competitive startups to gain entry.

Datacenter compute startups must now find alternative distribution methods. Many AI chip companies are creating dedicated clouds or their own service offerings. For instance, Cerebras built Condor Galaxy in conjunction with G42 to enable users globally to access their specialized hardware for AI training.

5. Get Customer Success Stories Quickly

Evidence of customer success creates a flywheel effect that generates more success. Buying new hardware is a significant capex expense, and risk-averse buyers often prefer legacy companies offering reliable, lower-performance alternatives. Many companies avoid buying from young startups as their procurement guidelines require evidence of significant market traction and funding.

To overcome this, chip startups need strong word-of-mouth from initial happy customers to create market pull. Solving for this “cold start” problem might mean offering discounts or contractual clauses like a full IP transfer in case of bankruptcy. One such tactic that we leveraged at Cerebras was to work with researchers at government labs who had budgets allocated specifically for novel hardware in our category.

6. Invest Initially in Software Over Hardware Manufacturing

Hardware manufacturing is tremendously expensive, especially in the world of semiconductors. There will be one— maybe two — shots at creating a physical design, after which it becomes prohibitively costly to tape out new silicon masks. This is why most hardware platform companies that manufacture chips eventually evolve into primarily software companies. Software becomes the bottleneck to improving existing hardware performance and generalizability.

When we made our Seed investment in Efficient Computer, an undervalued aspect of the company was that they had spent years building a fully functional compiler before spending a dime on hardware manufacturing. I believe initial investment in software should be the new standard for chip startups. EDA software tools are so advanced that it’s hard to justify the risk of taping out silicon without comprehensive functional and performance tests in simulation. Similarly, given the choice, companies should prioritize developing a functional compiler before investing heavily in hardware production. It’s a better use of resources and de-risks what will become the company’s long-pole software project.

7. Prioritize Energy Efficiency Over Power

Historically, compute users primarily focused on device performance, measured as device “operations” per unit of time (TOPS). This metric translates to mathematical operations executed on the computer, which amounts to useful work performed by the device. However, performance is no longer enough. Energy input now bottlenecks all compute modalities, making energy efficiency a primary consideration. The relevant metric is called energy efficiency (TOPS/Watt); a more energy-efficient device lasts longer and does more useful work for the same amount of power. Most “low power” computers don’t optimize for this, simply reducing overall power consumption. It’s akin to having a long-range car that sacrifices fuel efficiency or speed.

This issue is particularly important for datacenter chips, where power availability has become the largest constraint to bringing new compute capacity online. Over 95% of U.S. data center capacity is already reserved for the next few years and originating new datacenters is too difficult. The problem lies not in power generation, but power distribution: utilities can’t build power lines fast enough to meet demand. In this environment, compute startups will need to rethink assumptions around deployment conditions.

Building a resilient product might mean adapting to power-constrained environments and focusing on energy-efficient consumption. Build your device such that it requires less energy to operate at peak performance, still outperforming GPUs. For instance, Efficient created an entirely new category of hardware architecture for general-purpose computing to bring to bear this level of energy-efficiency.

8. Understand Reliability is a Multi-faceted Beast

What matters to users of compute is straightforward, but hard to accomplish — compute should just work (functionality) in a reasonable amount of time (performance). When a user kicks off a job to train a model, they want their code to compile and run to completion overnight without interruptions.

For GPUs, the mean-time-between-failures (MTBF) is about one million hours. This sounds impressive, but in a datacenter with 20,000 GPUs, MTBF equates to at least one failure every two days. In practice, failures happen even more often, disrupting user jobs. Data centers must factor these failure rates into their operating models and develop strategies for remediation and workload migration. This is a symptom of a significant issue: low system reliability.

Incumbents have been slow to provide the tools that predict and enhance device reliability. This gap creates a prime opportunity for new companies. By prioritizing reliability as a core design principle, new companies can improve user experience and deploy hardware that scales more effectively.

9. Weigh Specialization Against Generalizability

For AI chip companies, building a functional compiler is a complex challenge due to the vast number of kernels (mathematical operation primitives) required to support all customer use cases. Companies face a crucial decision: should they hand-write kernels, auto-generate them, or use a combination of both? Hand-writing kernels for a complex architecture is labor-intensive and can limit software generalizability. On the other hand, auto-generating kernels is highly complex, especially for atypical microarchitectures.

My advice to founders: focus on a domain-specific application where you can vertically integrate your software and hardware to achieve 100-1000x performance gains over incumbents. Find product-market fit first, then consider allocating resources towards a more general, horizontal platform. Capturing a market segment is much more challenging and nonlinear than the predictable engineering task of refactoring a codebase to platform-ize.

10. Specialization Goes Beyond Hardware (think: Markets)

Market specialization in the right niche can be powerful. For example, can you vertically integrate to make an optical metamaterials-based AI computer excel in endoscope applications? Or, like Cerebras, can you cater to unique customers, such as governments that prioritize diversifying away from the public cloud? Speaking of market specialization, Tensorrent is one of a few companies today offering a solution that encompasses both AI hardware and open software to leverage it. They couple AI chip designs with conventional compute capabilities, such as RISC-V cores, to meet customer needs outside of AI. By licensing IP, customers can embed Tenstorrent chips into their own products. However, Tenstorrent also sells their own chips, systems, and even data center racks to address all potential market needs.

When hardware inevitably fails to meet all customer needs, software should be employed to address its shortcomings. For instance, memory-bandwidth bottlenecks increase energy expenditure and costs when performing high volumes of inference on GPUs. Techniques like prompt caching — preloading frequently used prompts into memory — help reduce the need to fetch data from disk. Training algorithms can use latency-hiding to communicate data to a chip faster while another operation is performed. Algorithms exploiting structured data can replace slower GEMM kernels to speed up execution. Such software techniques can mask the limitations of hardware or augment its capabilities.

Final Thoughts

It’s an exciting time to build in the semiconductor sector. The barriers to entry for new chip companies have never been lower. However, emerging market obstacles present significant challenges for those aiming to innovate.

If you’re navigating the chip industry, have insights or questions, reach out to me on LinkedIn.

· publication