What to do: Professional Nvidia graphics card users concerned about security should strongly consider enabling error-correcting code if it isn't engaged by default. The GPU manufacturer advises that the security feature guards against a newly demonstrated type of Rowhammer attack affecting GDDR6 RAM.
Researchers at the University of Toronto recently demonstrated a new method for executing Rowhammer attacks on Nvidia A6000 workstation graphics cards with GDDR6 RAM. Although the vulnerability isn't being actively exploited, a mitigation is already available from Nvidia.
The company advises users to ensure that system-level error-correcting code (ECC) is enabled on Blackwell, Ada, Hopper, Ampere, Jetson, Turing, and Volta workstation and data center GPUs. Blackwell and Hopper graphics cards that support on-die ECC engage the feature automatically.
Users can confirm the status of SYS-ECC on earlier models by checking out-of-band (OOB) through the system BMC to the GPU or inspecting in-band (InB) via the CPU to the GPU. The procedure might require logging in to NVOnline through partners.nvidia.com.
Nvidia's security advisory describes the proper command for checking OOB ECC without NVOnline. To set the ECC mode, access the NSM Type 3 OOB document. Nvidia's SMBPBI OOB document can set product reconfiguration permissions, and the NVIDIA-smi page can set the ECC configuration for the InB path.
Rowhammer attacks involve rapidly accessing (or hammering) memory cells to exploit hardware-level vulnerabilities and cause bit-flips in neighboring cells. Bit-flipping, reversing individual ones and zeros within DRAM, can potentially cause substantial memory corruption.
In 2015, Google researchers discovered that the Rowhammer vulnerability could allow attackers to access kernel-level privileges on Linux systems with DDR3 RAM. DDR4 was demonstrated to be vulnerable the following year.
The new research represents the first successful attack on a GPU's GDDR RAM, dubbed GPUHammer. The method can severely degrade machine learning models, decreasing accuracy by up to 80 percent. Although the study only examined an Ampere chip with GDDR6 memory, newer models with GDDR7 or HBM2 RAM could also be vulnerable, but hacking them is likely more difficult.
ECC can detect and correct single-bit flip errors by introducing redundancy to memory cells, and it can detect (but not correct) double-bit errors. However, one of the researchers told Ars Technica that the security feature can degrade RTX GPU performance by around 10%.