//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>
The adoption of AI technologies is expanding so rapidly that the total available market for AI processors is expected to exceed $100 billion by 2030, Aart de Geus, chief executive of Synopsys, recently said in the company’s latest earnings call citing various market intelligence firms. The adoption of AI technologies is proceeding so swiftly by so many devices and applications that, in general, AI is becoming pervasive, which means that the AI hardware market is poised to diversify.
In fact, even today, the market is pretty much diversified. There are heavy-duty compute GPUs like Nvidia’s H100 that reside in cloud data centers, serving all types of AI and high-performance computing (HPC) workloads imaginable. These include, but aren’t limited to, special-purpose AI processors from Amazon Web Services (Trainium and Inferentia), Google (TPU), Graphcore and Intel (Gaudi for training and inference, Greco for inference), as well as edge-optimized AI processors like Apple’s NPU and Google’s Edge TPU.
Currently, there are only a few architectures able to serve a variety of AI deployments, from the edge to the data center. One such architecture is d-Matrix’s digital in-memory compute (DIMC) engine architecture, which can enable AI accelerators in a variety of form factors, from an M.2 module to a FHFL card or even an OAM module, for a variety of applications, from an edge server or even a PC to a server rack, thanks to its inherent scalability and built-in SRAM.
While tech giants like Nvidia, Intel and AMD are making headlines amid a generative AI frenzy—seemingly poised to control the market of hardware for training and inference going forward—startups like d-Matrix actually have a good chance if they offer the right hardware and software tailored for specific workloads.
“If they focus on a specific workload and have the software and models to make it easy to use, a startup like d-Matrix can carve out a niche,” said Karl Freund, founder and principal analyst of Cambrian AI Research.
D-Matrix inference platform
The startup says its hardware was optimized for natural-language–processing transformer models (BERT, GPT, T5, etc.) used for a variety of applications from the ground up, including machine translation, text generation and sentiment analysis.
“We took a bet in 2020 and said, ‘Look, we will build the entire computing platform, the hardware and the software, transformer acceleration platform, and focus on inference,’” said Sid Sheth, CEO and co-founder of d-Matrix. “[In] late 2022, when the generative AI explosion happened, d-Matrix emerged as one of a few companies that had a computing platform for generative AI inference. So we kind of organically grew into that opportunity over a period of three years. All our hardware and software has been foundationally built to accelerate transformers and generative AI.”
Unlike Nvidia or Intel’s Gaudi platforms, d-Matrix’s hardware and software are specifically tailored for inference. Models that d-Matrix’s processors will use can be trained on different platforms and may also be trained with different data types—the d-Matrix Aviator software stack enables users to select the appropriate data format for best performance.
“The Aviator ML toolchain allows users to deploy their model in a pushbutton fashion in which Aviator selects the appropriate data format for best performance,” Sheth said. “Alternatively, users can simulate performance with different d-Matrix formats and choose the preferred format based on specific constraints like accuracy degradation. Regardless, no retraining is needed, and models can always be run in their natively trained format if desired.”
This approach makes a lot of sense, according to Karl Freund.
“This approach makes it easy to try a model, optimize the model and deploy a solution,” he said. “It is a very nice approach.”
Hardware and scalability
The first products to feature d-Matrix’s DIMC architecture will be based on the recently announced Jayhawk II processor, a chiplet containing about 16.5 billion transistors (slightly more than Apple’s M1 SoC) and designed to scale up to eight chiplets per card and up to 16 cards per node.
With its architecture, d-Matrix took a page from AMD’s book and relied on chiplets rather than on a big monolithic die. This provides flexibility when it comes to costs and the ability to address lower-power applications.
“[Multi-chiplet designs] should be a cost advantage and a power advantage as well,” Freund said.
Each Jayhawk II chiplet packs a RISC-V core to manage it, 32 Apollo cores (with eight DIMC units per core that operate in parallel), 256 MB of SRAM featuring bandwidth of 150 TB/s, two 32-bit LPDDR channels and 16 PCIe Gen5 lanes. The cores are connected using a special network-on-chip with 84-TB/s bandwidth. Each chiplet with 32 Apollo cores/256 DIMC units and 256 MB of SRAM can be clocked at over 1 GHz.
Each DIMC core can execute 2,048 INT8 multiply-accumulate (MAC) operations per cycle, according to TechInsights. Each core can also process 64 × 64 matrix multiplications using both industry-standard (INT8, INT32, FP16, FP32) and emerging proprietary formats (block floating-point 12 [BFP12], BFP16, SBFP12).
“While they may want to add INT4 in the future, it is not yet mature enough for the general use cases,” Freund said.
The main idea behind d-Matrix’s platform is scalability. Each Jayhawk II has die-to-die interfaces offering die-to-die bandwidth of 2 Tb/s (250 GB/s) with 3-mm, 15-mm and 25-mm reach on organic substrate based on the Open Domain-Specific Architecture (ODSA) standard at 16 Gb/s per wire. Organic substrates are rather cheap and widespread, so d-Matrix won’t have to spend money on advanced packaging.
The current design allows d-Matrix to build system-in-packages (SiPs) with four Jayhawk II chiplets that boast 8 Tb/s (1 TB/s) of aggregated die-to-die bandwidth. Meanwhile, to enable SiP-to-SiP interconnections, d-Matrix uses a conventional PCIe interface, based on an image provided by the company.
For now, d-Matrix has a reference design for its FHFL Corsair card that carries two SiPs (i.e., eight chiplets) with 2 GB of SRAM and 256 GB of LPDDR5 memory onboard (32 GB per Jayhawk II) and delivers a performance of 2,400–9,600 TFLOPS depending on the data type at 350 W. The peak performance can be reached with a BFP12 data format, which makes it fairly hard to compare directly with compute GPUs from Nvidia.
But assuming that Corsair’s INT8 performance is 2,400 TOPS, it’s very close to that of Nvidia’s H100 PCIe (3,026 TOPS at up to 350 W). The startup says that 16 Corsair cards can be installed into an inference server.
In addition, the company mentioned that its 16-chiplet OAM module with four SiPs, 4 GB of SRAM and 512 GB of LPDDR5 DRAM is set to compete against AMD’s upcoming Instinct MI300X and Nvidia’s H100 SXM. The module will consume about 600 W, but for now, d-Matrix won’t disclose its exact performance.
On the other side of the spectrum, d-Matrix has an M.2 version of its Jayhawk II with only one chiplet. Because the unit consumes 30–40 W, it uses two M.2 slots—one for the module and one for the power supply, the company said. At this point, one can only wonder which form factors will become popular among d-Matrix’s clients. Yet it’s evident that the company wants to address all applications it possibly can.
“I think the company is fishing, trying to find where they can gain first traction and expand from there,” Freund said.
The scalable nature of d-Matrix’s architecture and accompanying software allows it to aggregate integrated SRAM memory into a unified memory pool offering a very high bandwidth. For example, a machine with 16 Corsair cards has 32 GB of SRAM and 2 TB of LPDDR5, which is enough to run many AI models. Yet the company doesn’t disclose chiplet-to-chiplet and SiP-to-SiP latencies.
“Chiplets are building blocks to the Corsair card solution [8× chiplets per card], which are building blocks to an inference node—16 cards per server,” Sheth said. “An inference node will have 32 GB of SRAM storage [256 MB × eight chiplets × 16 cards), which is enough to hold many models in SRAM. In this case, [2 TB] of LPDDR is used for prompt cache. LPDDR can also be used as coverage for cases in which key-value cache or weights need to spill to DRAM.”
Such a server can handle a transformer model with 20 billion to 30 billion parameters and could come toe to toe against Nvidia’s machines based on A100 and H100 compute GPUs, d-Matrix claims. In fact, the company says that its platform offers a 10× to 20× lower total cost of ownership for generative inference when compared with “GPU-based solutions.” Meanwhile, the latter is available and being deployed now, whereas d-Matrix’s hardware will only be available next year and will compete against successors of existing compute GPUs.
“[Our architecture] does put a little bit of a constraint in terms of how big a model we can fit into SRAM,” Sheth said. “But if you are doing a single-node 32-GB version of SRAM, we can fit 20 [billion] to 30 billion parameter models, which are quite popular these days. And we can be blazing fast on that 20 [billion] to 30 billion parameter category compared with Nvidia.”
One of the strongest sides of Nvidia’s AI and HPC platforms is their CUDA software stack and numerous libraries optimized for specific workloads and use cases. This greatly simplifies software development for Nvidia hardware, which is one of the reasons why Nvidia dominates the AI hardware landscape. The competitive advantages of Nvidia require other players to put a lot of effort into their software.
The d-Matrix Aviator software stack encompasses a range of software elements for deploying models in production.
“The d-Matrix Aviator software stack includes various software components like an ML toolchain, system software for workload distribution, compilers, runtime, inference server software for production deployment, etc.,” Sheth said. “Much of the software stack leverages broadly adopted open-source software.”
Most importantly, there’s no need to retrain models trained on other platforms—d-Matrix’s clients can just deploy them in an “it just works” manner. Also, d-Matrix allows customers to program its hardware at a low level using an actual instruction set to get higher performance.
“Retraining is never needed,” Sheth said. “Models can be ingested into the d-Matrix platform in a ‘pushbutton, zero-touch’ manner. Alternatively, more hands-on–oriented users will have the freedom to program close to metal using a detailed instruction set.”
Jayhawk II is now sampling with interested parties and is expected to be commercially available in 2024.
“With the announcement of Jayhawk II, our customers are a step closer to serving generative AI and LLM applications with much better economics and a higher-quality user experience than ever before,” Sheth said. “Today, we are working with a range of companies large and small to evaluate the Jayhawk II silicon in real-world scenarios, and the results are very promising.”