Tensordyne Claims Massive Speed and Power Improvement Over Nvidia

If simulations are to be believed, startup Tensordyne’s new AI chip could crush the performance of market leader Nvidia in terms of energy efficiency and latency for inferencing. The company just sent the plans for its first chip to be manufactured, with commercial sales of a 72-chip system scheduled for the second half of 2027. Tensordyne claims its 72-chip system can run large LLMs four times as fast using one-fifth the power compared to a 72-Nvidia GB300 system. However, real systems won’t be around to back these figures up until the end of the year.

The not-so-secret sauce behind the outsized efficiency of Tensordyne’s new chip, Napier, is how it does matrix multiplication, the main math of AI. It takes advantage of the fact that the logarithm of A times B equals the logarithm of A plus the logarithm of B.

“We’ve turned multipliers into adders,” explains Gilles Backhus, a Tensordyne founder and vice president of AI. Adders are smaller and more energy efficient logic circuits than those that do multiplication, he says. So Napier can pack more compute into a smaller area and still save on power.

New kinds of numbers

That such a thing was possible has long been known, but there wasn’t a good way to use it, because converting back and forth between logarithmic numbers and the floating point numbers that describe neural networks took too much time and energy and introduced too many inaccuracies. Not anymore, according to Backhus.

“So far no one has figured out how to do the linear to logarithm and logarithm to linear conversion as we have,” he says. “And that’s actually the crux of that whole thing. Our engineers have figured out ways to do this very elegantly and very very accurately and cheaply on silicon.”

The importance of number formats hasn’t been lost on the AI industry. Speaking at IEEE Hot Chips in 2023, Nvidia chief scientist attributed the majority of the improvement in the company’s GPUs at the time to the use of shorter number formats and the smaller circuits they require.

Researchers have also worked on circuits to compute with alternative formats, such as the logarithm-like posit and more recently its scientific-computing counterpart the takum. However, these formats have not reached mainstream adoption mostly because their hardware implementation is so different from traditional floating point.

Inference Demands Influence Architecture

Market trends, including the rise of AI agents, mean inference—the execution of neural network models—is becoming more important than training new large-language models. Factors like the cost and the speed at which answers are delivered are starting to dominate, and that’s led AI companies to look for system architectures that are a better fit for that.

Tensordyne executives say they saw this coming and engineered their computers to meet it.

Silver chip with green circuitry on black background Tensordyne’s Napier AI chip includes 144 gigabytes of HBM, but the real power comes from its unusual math.Tensordyne

There are two main parts to executing an LLM: prefill and decode. In the prefill stage the model takes in the input text and turns it into tokens, the basic units it can work with, and builds a kind of working memory about the input, called the key value cache. It’s a computationally heavy task.

Decode is where the LLM generates its output tokens, the answer or response to your input. Each new token is predicted using the previous token and the key-value cache. This sequential nature can make decode a slower process, and it’s more dependent on memory and network latency than computing power.

So, AI chip makers are starting to build systems with those two different demands in mind. Nvidia is touting a system where a server rack full of B300 GPUs handles prefill and several racks of its Groq 3 processors do the decode. Amazon Web Services is combining a rack of its Trainium AI chips for prefill with several racks of Cerbras’s wafer-scale computers for decode.

Tensordyne says its system can handle both jobs. “We’re optimizing for two hard challenges here at the same time,” says R.K. Anand, chief product officer and co-founder of Tensordyne. “We’re the first company proving that you can do both without going to multiple vendors and multiple racks.”

The dense compute needed for prefill comes from the logarithmic math. The needs of decode come from 144-gigabytes of high-bandwidth memory and a custom 1-microsecond-latency network called Tensordyne Napier Link.

In a “pod” system that fits in one quarter of a standard rack, Tensordyne packs in 72 Napier chips, 8 Intel Xeon CPUs, and 64 terabytes of solid-state storage. A 4-pod rack working on a 2-trillion parameter LLM would deliver 1300 tokens per second per user at a cost of $11 for 1 million tokens while consuming 120 kilowatts of power, the company claims, with one pod crunching out prefill and 3 working on decode. To get similar tokens per second per user numbers a 9-rack Rubin and Groq 3 system would likely consume 1.5 megawatts, according to Tensordyne.

Whether or not these numbers really hold up will have to wait until later in the year. Tensordyne plans to have a beta version available through the cloud for customers to work with. It expects to begin shipping systems to customers about a year from now.

From Your Site Articles