PEAK:AIO uses CXL memory to rescue HBM-limited AI models

May 27, 2025

PEAK:AIO claims it is solving AI inferencing model GPU memory limitations with CXL memory instead of offloading KVCache contents to NVMe flash drives.

The UK-based AI and GPU data infrastructure specialist says AI workloads are evolving “beyond static prompts into dynamic context streams, model creation pipelines, and long-running agents,” and the workloads are getting larger, stressing the limited high-bandwidth memory (HBM) capacity of GPUs and making the AI jobs memory-bound.

This causes a job’s working memory contents, its KVCache, to overflow HBM capacity, meaning tokens get evicted and have to be recomputed when needed again, lengthening job run-time. Various suppliers have tried to augment HBM capacity by having, in effect, an HBM memory partition on external flash storage, similar to a virtual memory swap space, including VAST Data with VUA, WEKA with its Augmented Memory Grid, and Pliops with its XDP LightningAI PCIe-add-in card front-ending NVMe SSDs.

PEAK:AIO is developing a 1RU token memory product using CXL memory, PCIe gen 5, NVMe and GPUDirect with RDMA.

Eyal Lemberger, Chief AI Strategist and Co-Founder of PEAK:AIO, said in a statement: “Whether you are deploying agents that think across sessions or scaling toward million-token context windows, where memory demands can exceed 500GB per model, this appliance makes it possible by treating token history as memory, not storage. It is time for memory to scale like compute has.”

PEAK:AIO says its appliance enables:

KVCache reuse across sessions, models, and nodes
Context-window expansion for longer LLM history
GPU memory offload via CXL tiering
and Ultra-low latency access using RDMA over NVMe-oF

It claims that by harnessing CXL memory-class performance it delivers token memory that behaves like RAM, not files. The other suppliers listed: Pliops, VAST and WEKA, cannot do this. Mark Klarzynski, Co-Founder and Chief Strategy Officer at PEAK:AIO, said: “This is the token memory fabric modern AI has been waiting for.”

We’re told the tech gives AI workload developers the ability to build a system that can cache token history, attention maps, and streaming data at memory-class latency. PEAK:AIO says it “aligns directly with Nvidia’s KVCache reuse and memory reclaim models” and “provides plug-in support for teams building on TensorRT-LLM or Triton, accelerating inference with minimal integration effort.”

In theory PCIe gen 5 CXL controller latency can be around 200 nanoseconds while GPUDirect-accessed NVMe SSD access latency can be around 1.2ms (1,200,000ns); 6,000 times longer than a CXL memory access. Peak’s token memory appliance can provide up to 150 GB/sec sustained throughput at <5 microsecond latency.

Lemberger claimed: “While others are bending file systems to act like memory, we built infrastructure that behaves like memory, because that is what modern AI needs. At scale, it is not about saving files; it is about keeping every token accessible in microseconds. That is a memory problem, and we solved it [by] embracing the latest silicon layer.”

PEAK:AIO token memory appliance is software-defined, using off-the-self servers and is expected to enter production by the third quarter.

PEAK:AIO uses CXL memory to rescue HBM-limited AI models

ABOUT US

FOLLOW US

Storage news ticker – July 25

Sandisk assembles advisory board to guide High Bandwidth Flash strategy

Backtracking Amazon DynamoDB: Clumio pushes recovery play