WEKA builds NeuralMesh architecture for exascale AI workloads

June 18, 2025

WEKA has announced a NeuralMesh transformation of its parallel file system software to accelerate AI at scale.

The basis of this is that traditional storage architectures have not been designed to handle the scale, latency, sensitivity, and concurrency demands of the new distributed AI agent training and inference environments. NeuralMesh is composed of microservices operating in a dynamically connected mesh of nodes, connecting data, storage, compute, and AI. It can provide guaranteed microsecond-latency SLAs and becomes more resilient as it scales, with more nodes involved in rebuilding widely distributed data stripes from failed nodes.

WEKA says: “When hardware fails, the system rebuilds in minutes, not hours. As data grows to exabytes, performance improves rather than degrades.”

Liran Zvibel, cofounder and CEO at WEKA, stated: “AI innovation continues to evolve at a blistering pace. The age of reasoning is upon us. The data solutions and architectures we relied on to navigate past technology paradigm shifts cannot support the immense performance density and scale required to support agentic AI and reasoning workloads. Across our customer base, we are seeing petascale customer environments growing to exabyte scale at an incomprehensible rate. The future is exascale.”

NeuralMesh runs on-premises, in datacenters and edge sites, on bare metal or virtual machines, and in the public clouds, and neo-clouds – the GPU server farms such as Nebius – with a universal namespace. It can start small and grow in capacity from TB to PB to EB, gaining overall performance and resiliency en route.

Zvibel told B&F: “We have been containerized from the very early, but now we’re making it more formal. We’re making it more visible to the outside world. We’re adding a lot more container and service types. Also, we’re making the deployment a lot more flexible and we run some of our containers also on the clients. So essentially we’re providing an end-to-end complete solution and we’ve started doing it also with the current implementations. When we run at many of the large neo-clouds, we have developed a Kubernetes operator, you can integrate our operator with their Kubernetes, then we just run as part of their infrastructure.”

He said of the neo-clouds: “So many of them use us for their infrastructure. So their customers may not be aware that it’s WEKA and some of them make it public. We’ve just announced Nebius, but we probably have more Nvidia NCP clouds than anyone else.” NCP stands for Nvidia Cloud Partner.

NeuralMesh distributes data and metadata across all nodes, balancing I/O dynamically with built-in auto-healing, auto-scaling, and fast rebuild capabilities. WEKA says a genomics research institution scaled from 2 PB to 12 PB without downtime or rebalancing, achieving consistent I/O latency and eliminating weekend maintenance windows.

The software supports concurrent model training and inference, and “eliminates performance bottlenecks from traditional storage layers.” It provides real-time, petabyte-scale observability across data paths, providing insights into performance metrics, and infrastructure health, integrated with dashboards, alerts, and telemetry APIs.

It also supports tiering between TLC/QLC NVMe SSDs and object stores, and container storage integration along with encryption in flight and at rest, snapshots, Snap-to-Object, and role-based access control (RBAC).

NeuralMesh supports Nvidia, AMD, and other suppliers’ GPUs and accelerator hardware. Chad Wood, HPC Engineering Lead at Stability AI, said: “With WEKA, we now achieve 93 percent GPU utilization during AI model training and have increased our cloud storage capacity by 1.5x at 80 percent of the previous cost.”

Mesh details

Blocks & Files: A mesh implies connected nodes to me. What nodes running WEKA software or controlled by WEKA software comprise this mesh?

WEKA: There are two parts to the mesh architecture in NeuralMesh – microservices and nodes. These two concepts work together to deliver our mesh, which is essentially a software-defined fabric that interconnects data, compute, and AI services in a modular and composable way.

Each node in the system runs one or usually more microservices, and each microservice handles a specific set of functions such as data access, metadata, auditing, protocol communication, or observability. These services communicate with each other through well-defined APIs, enabling dynamic orchestration across the entire infrastructure.

Unlike traditional storage systems tied to rigid hardware architectures, NeuralMesh’s fully containerized, service-oriented design allows every capability to scale independently. This provides:

Elastic scalability to exabytes and beyond without performance loss
Fine-grained resource isolation ideal for secure multitenant environments
Cloud-native flexibility across bare metal, cloud, and hybrid deployments

In short, the “mesh” refers to the distributed, interconnected microservices running across WEKA nodes that collectively deliver high-performance, resilient, and AI-native storage infrastructure.

Blocks & Files: NeuralMesh becomes more powerful and resilient as it scales and it can recover from node failure. Are there numbers to indicate how it becomes more powerful as it scales?

WEKA: NeuralMesh becomes more powerful and resilient as it scales because:

Distributed Striping Across Failure Domains – Data is striped in small blocks across all failure domains, so losing one node only affects a tiny portion of each stripe. The system keeps running without performance impact. The larger the cluster, the more broadly the stripes are spread, decreasing the exposure to failures. Example: for a stripe size of 18 (16+2) and a cluster size of 20 the number of possible stripe combinations is 190, adding one more server to bring the cluster size to 21 increases the possible number of combinations to 1330. As the cluster size grows to 25, the number of possible stripe combinations is now 480,700.
Massively Parallel Rebuilds – Every available compute core helps with the erasure coding calculations for rebuilds, even if it doesn’t own the data. For example, in a 50-node cluster with 1 node failure, the cores of the other 49 nodes participate in recovery. In a 100-node cluster, the cores of 99 nodes help – effectively doubling the rebuild speed. More nodes = more cores = faster recovery.
Prioritized Recovery – NeuralMesh intelligently rebuilds the most at-risk data first, stripes impacted by multiple failures, restoring full protection quickly even during multiple simultaneous failures.

***

NeuralMesh is the new way to obtain WEKA software. WARRP, the WEKA AI RAG Reference Platform, is included in NeuralMesh, as is the Augmented Memory Grid (AMG). About this, Zvibel tells us: “When we are running with these workloads and when you connect WEKA on the backend network, we actually have access to eight NICs for the AUG use. It’s a total of 128 PCIe lanes. This is actually more PCIe lanes than what the CPU has.”

The NeuralMesh software is available in limited release for enterprise and large-scale AI deployments, with general availability scheduled for fall 2025. Get a NeuralMesh datasheet here and read a NeuralMesh blog here.

WEKA builds NeuralMesh architecture for exascale AI workloads

Mesh details

ABOUT US

FOLLOW US

Storage news ticker – July 25

Sandisk assembles advisory board to guide High Bandwidth Flash strategy

Backtracking Amazon DynamoDB: Clumio pushes recovery play