Memory is arguably the most serious constraint on modern AI large language models (LLMs). According to one influential paper, LLM token generation is an inherently memory-bound task, meaning the rate at which models output text is limited by how quickly data can be read in from memory. The severity of this bottleneck grows with model size. This creates a “memory wall” that holds back LLM inference performance.
AI hardware startup Majestic Labs is taking a direct—and comprehensive—approach to solving this problem. It’s developing a new AI server, Prometheus, with up to 128 terabytes of memory. That’s over 60 times more than Nvidia’s DGX B300 server, a cutting-edge AI processing rack.
Sha Rabii, co-founder and president of Majestic Labs, believes that this drastic increase in memory will provide his company an edge. While he acknowledges that “Nvidia’s done a phenomenal job creating a system that can scale out,” he argues that it becomes less economical as models grow and “ends up greatly over-provisioning on compute and starving on memory.”
DRAM-Centric Architecture for LLM Memory
Majestic Labs plans to surmount the “memory wall” with an architecture that fundamentally differs from competitors’.
Nvidia’s current servers have fast high-bandwidth memory (HBM), which is typically used to read in an LLM’s model weights. In addition, there’s an often larger but slower pool of dynamic random access memory (DRAM), which handles LLM and server overhead. Majestic instead goes all in on DRAM (specifically LPDDR6) in a unified architecture.
Rabii says that most memory interfaces are designed to operate over a short physical distance—sometimes only a few millimeters. That limits how much memory can be placed. “You get this shoreline at the compute die where you can put your HBM. If you wanted to put more, you can’t,” Rabii explains.
To solve that, Majestic uses a proprietary memory interface constructed from miniature copper cables that’s effective up to a meter. This is paired with custom memory aggregation chips that sit physically next to memory modules and coordinate memory across the server.
“It’s an endpoint for that high-speed interface and fans out to many, many commodity DRAM chips,” explains Rabii. In addition to addressing large pools of memory, Majestic says this design offers memory bandwidth up to 25.6 terabytes per second.
Ignite AI Processor for LLM Acceleration
More memory is good, but it needs to be paired with AI acceleration, something akin to Nvidia’s GPU. Majestic’s solution to this is Ignite, a custom AI processing unit that serves as the server’s compute engine. The Prometheus server contains 12 Ignite chips.
Ignite combines data-center-class ARM application cores with RISC-V vector and tensor cores on a single die, all sharing the same memory space. The ARM cores act as an on-chip host processor to orchestrate the AI model. The RISC-V cores carry out the actual LLM processing. The result is a single chip that handles multiple aspects of LLM inference demands without handing off between processors. Majestic Labs has yet to reveal specific metrics for Prometheus’ compute performance.
Rabii acknowledges that software is important as well, given that many AI frameworks are already entrenched. “We’re trying to reduce friction as much as possible in every aspect of our customer adoption, whether it’s physical or software,” he says. Prometheus will support PyTorch, vLLM, and OpenAI’s Triton inference frameworks without requiring code modifications. That means existing models compatible with these frameworks can run as-is.
Prometheus Server Design and Pricing
All of this combines in the server itself, which is Open Compute Project-compliant. Up to four servers can fit in a server rack; power draw is expected to total up to 120 kilowatts per rack; and heat will be managed with cold-plate liquid cooling. The server’s memory design is modular, which means servers purchased with less than the maximum of 128 TB of memory can be upgraded at a later date.
Despite the breadth of the project, Majestic wants to position Prometheus on price, too—which might be a surprise given how much memory each server can contain. Majestic argues that this will be possible because it uses DRAM instead of HBM. Pricing has not yet been announced, as Prometheus is expected to ship in 2027.
“Our customers’ capital expenditure will come down by, depending on the workload, 10 to 50 times, and the power consumption comes down by a similar amount,” Rabii claims.
From Your Site Articles
Related Articles Around the Web