Nvidia unveils Vera, an 88-core Arm CPU for AI and analytics racks

The liquid-cooled Vera rack packs 256 CPUs and 300 TB/s of memory bandwidth

By Skye Jacobs March 17, 2026, 16:02 Add TechSpot on Google

Nvidia unveils Vera, an 88-core Arm CPU for AI and analytics racks

Serving tech enthusiasts for over 25 years.
TechSpot means tech analysis and advice you can trust.

Forward-looking: Nvidia used its GTC 2026 developer conference in San Jose to unveil new details about its Vera data center CPU line, highlighting an 88-core Arm-based design and a dense rack architecture that positions the company directly in the mainstream CPU market. Alongside the chip, Nvidia is introducing a liquid-cooled rack that aggregates 256 Vera CPUs, signaling its intent to compete for CPU sockets in large-scale AI and data analytics deployments.

Unlike Nvidia's earlier Grace processors, which were primarily sold as companions to GPUs, Vera is positioned as a general-purpose data center CPU with a strong focus on AI-centric workloads such as agentic frameworks, scripting-heavy pipelines, analytics, and code compilation. The chip is built on 88 Nvidia-designed Arm v9.2-A "Olympus" cores, up from Grace's 72 Arm Neoverse cores. Nvidia claims a 1.5× increase in instructions per cycle over the previous generation, which it translates into roughly 50% higher performance versus "standard" CPUs in its internal comparisons.

Vera targets both training-adjacent CPU tasks and pure CPU workloads, aiming to provide a platform that can efficiently run Python-heavy agent logic, SQL queries, and compilation pipelines while keeping GPUs fully saturated.

At the microarchitectural level, Nvidia's most distinctive change is the introduction of what it calls spatial multi-threading in the Olympus cores. Unlike the usual simultaneous multi-threading model, where two threads time-slice shared resources, Vera's design physically partitions key structures – such as execution units, caches, and register files – so that each hardware thread can make forward progress without waiting for access to the same datapath.

The goal is to increase instruction-level parallelism and throughput by allowing one thread to opportunistically use idle resources while the other continues running. Nvidia argues that this approach delivers more predictable performance in multi-tenant environments. In practice, both threads are intended to run in a genuinely concurrent manner on a single core.

The core complex is organized as a single coherent domain rather than multiple NUMA regions, marking a notable departure from the topology used in contemporary high-core-count x86 designs. Nvidia is leveraging a new generation of its Scalable Coherency Fabric, built on Arm's CMN lineage, to tie the 88 cores together. The company has not disclosed exactly how it manages intra-chip latency at this scale but appears to rely on an updated mesh network.

Memory bandwidth and capacity are a major focus for Nvidia. Grace delivered 546 GB/s to its mesh and about 7.6 GB/s per core; Vera more than doubles aggregate bandwidth to 1.2 TB/s and triples capacity to up to 1.5 TB of SOCAMM LPDDR5 or LPDDR5X, providing roughly 13.6 GB/s per core under full load.

The fabric can also deliver up to 80 GB/s per core when other cores are not fully saturated – an important feature for bandwidth-hungry threads pinned to a subset of cores, such as those used in graph analytics or certain parts of LLM runtimes.

The front-end and execution pipeline are explicitly designed for AI and data workloads. Vera's execution pathway includes a 10-wide instruction decode block, aggressive for a general-purpose server CPU, intended to keep the widened back end fully fed. A neural branch predictor capable of handling two branch predictions per cycle, along with a custom prefetch engine optimized for graph analytics, aims to reduce stalls in irregular control-flow and pointer-chasing code, including graph traversal and complex analytics kernels.

Nvidia also highlights a PyTorch-optimized instruction buffer, suggesting that common instruction sequences in AI frameworks have been treated as first-class optimization targets.

On the platform side, Vera meets the expected standards – support for PCIe 6.0, CXL 3.1, and two-socket configurations – while adding features aimed at secure, tightly coupled CPU – GPU systems. Confidential computing is supported across both CPU and GPU domains, enabling encrypted execution and isolation that extend into GPU memory and multi-socket nodes, a capability Nvidia did not offer with Grace.

A second-generation NVLink-C2C interface provides up to 1.8 TB/s of die-to-die bandwidth, doubling Grace's 900 GB/s link and significantly exceeding what PCIe 6.0 can deliver.

The Vera CPU rack integrates up to 256 liquid-cooled Vera CPUs alongside BlueField-4 DPUs and ConnectX-class SuperNICs, creating a CPU-centric system designed to host CPU-only AI and analytics workloads at high density. The rack supports up to 400 TB of LPDDR5-class memory and around 300 TB/s of aggregate memory bandwidth, exposing more than 22,500 CPU environments backed by over 22,000 Olympus cores and 45,000 threads.

Nvidia positions this configuration as capable of delivering up to 6× higher CPU throughput and roughly 2× better performance in agentic AI workloads versus traditional CPU racks, according to its internal benchmarks. In its testing, the company reports that Vera achieves roughly 1.5× better performance per sandbox compared with x86 competitors, with about 3× the memory bandwidth per core and twice the efficiency, and that it outpaces Grace by 1.8× to 2.2× across scripting, compilation, analytics, and HPC workloads.

These claims have not yet been independently verified, but they align with Vera's architectural focus on bandwidth, instruction-level parallelism, and multi-threading. The chips are now in full production and are expected to ship to partners in the second half of the year, with systems coming from major OEMs and ODMs, and integration into Nvidia's Rubin-generation HGX NVL8 and broader Vera Rubin platform.

For data center operators already standardizing on Nvidia's GPU platforms, Vera converts the CPU from an external dependency into a controllable part of the stack, with a design tuned for AI-heavy, multi-tenant environments rather than traditional general-purpose server loads. How much that matters in practice will depend on whether Nvidia's performance and efficiency claims hold up once third-party benchmarks arrive later this year.

See more TechSpot in Google Add us as a preferred source and our reporting shows up first when you search.

Add TechSpot

// Related Stories

Featured on TechSpot