It could well be a combination of both factors, but given that other RDNA 3 GPUs seem to have this issue as well, the problem is one that AMD really needs to address somehow. Take the 7900 XT, as an example -- compared to the 6800 XT, it's 32 to 35% faster, on average, with some titles being around 60% faster. That sounds absolutely fine until you look at the theoretical performance metrics.I'm going to admit I don't know the answer to this. Is that something that could be fixed with driver updates or is that baked in and there is no fixing it without a hardware revision?
The 7900 XT has a 148% higher peak FP32 throughput, a 24% higher texel fill rate, a 60% higher pixel fill rate, and a 56% higher memory bandwidth. The CUs just don't seem to be properly utilized at the moment. Unfortunately, I don't have any Radeon GPUs to do ALU profiling to see what's actually going on.
Edit: Chips and Cheese have already examined this, though purely from a microbenchmarking aspect:
We only see convincing dual issue behavior with FP32 adds, where the compiler emitted v_dual_add_f32 instructions. The mixed INT32 and FP32 addition test saw some benefit because the FP32 adds were dual issued, but could not generate VOPD (vector operation, dual issue) instructions for INT32 due to a lack of VOPD instructions for INT32 operations. Fused multiply add, which is used to calculate a GPU’s headline TFLOPs number, saw very few dual issue instructions emitted.
So AMD's compiler isn't super sophisticated (which may well be the norm for GPUs) and probably isn't something that's going to change much. On the other hand, AMD can offer shader replacements, via driver updates, that can bypass the problem and generate dual issues for shaders that would normally bog down the CUs when instructions are single-issued.
Last edited: