Even in that case, "you shouldn't look at core count at all" is still correct. For example, the Ryzen 7600 gives you the same multicore performance as the 5800X despite having 2 fewer cores. Even in a game that scales perfectly with core count, a 5800X still won't outperform a 7600.
Core count is a meaningless spec in the sense that you should only look at the performance you get, not how many cores the chip has. And, especially when comparing chips of different architectures, core count doesn't necessarily correlate with the performance you get. We've seen that multiple times, like the 12400F outperforming older eight-cores, or the Ryzen 5600/7600 outperforming their high core count predecessors. You can technically make that argument for something like the 7950X today, because it's a current gen chip so there's nothing that outperforms it (in MT) with lower core counts, but once Zen 5/Arrow Lake launches later this year there will be.
There is no such thing as "software that need X amount of cores". If you say "you need a 16-core 7950X to have optimal performance in Cities Skylines 2", but later a 12-core Zen 5 chip outperforms the 7950X in Cities Skylines 2, then the core count wasn't the part that actually mattered.
You're mostly right. It's a very easy claim to make in non-enterprise settings and when comparing across architectures. Within the consumer space, and within a single architecture, core count is a simple enough proxy to use for that performance, which is why it can still be valid to look at it. To say that another product comes along in a few years and can do the same or more with fewer cores is great, but I can't buy that product today since it doesn't exist. Today, it could be a very valid claim that x cores performs better than x-2 cores. Even if a newer product comes along with x-2 cores that performs better than my CPU with x cores, that doesn't really matter to me unless I upgrade to that product.
At the end of the day you are right, you should only look at the performance and capabilities that you get, and that's true for any hardware. The specs can be a simple proxy for those benchmarks, though, and that's what's being lost in the back and forth here: we all agree that more performance is better, we are just using different methods to look up what that performance is. One of those methods (looking up core count) comes with the same caveats that looking up gigahertz and cache have: it only really works within a generation, within an architecture, and within a vendor.
All that said, in the enterprise space, core count does imply capability, regardless of actual performance, on the simple fact of isolation of cores across virtual machines. But that capability usually doesn't matter for consumers, and certainly not for gaming (except from a cloud gaming service provider standpoint). It really only matters to consumers when multitasking (such as gaming and doing something else at the same time) and where the context switch overhead is enough that the extra cores can bring more performance when a lower core part would otherwise be equal or surpass (more cores could also be useful for power saving with hybrid architectures, but that's a different topic).
Surely the best way to define a CPU's comparative speed is by total instructions per second that can be processed?
If all instructions could be executed in the same amount of time, regardless of which CPU architecture they are running on, you would be correct. But they can't, not even within the same architecture, which means the workload matters. The workload controls the proportion of time you need to execute those longer running instructions versus shorter running ones. It isn't just about proportion, the order matters, too. Branching instructions (if..else) are very fast when the processor correctly predicts which branch it will execute (speculative execution), but when it is wrong it has to backtrack, and that takes extra cycles (speculative execution is how a lot of the recent vulnerabilities in CPUs happen, like Spectre and Metldown for Intel, and GoFetch for the Mac M-series in the memory fetch prediction algorithm).
For example, a 64 bit CPU can add two 64 bit numbers in constant time. That means that regardless of the values of those numbers, the addition takes the same amount of time. And by time, I mean clock cycles, which is where the gigahertz number becomes important in determining how fast a CPU is (keeping all else constant). Multiplication and division, on the other hand, takes many clock cycles, and division usually taking longer (which is why optimizing compilers will translate division into multiplication when they can).
You may hear about AI processors being capable of TOPs (trillions of operations per second), but it's important to note that this is usually at a given precision level, such as 4 bit or 8 bit numbers. They can't do nearly that many when the numbers being operated on are bigger, and this is because those numbers physically take up more space (a 64-bit wide register can only hold 64 bits, so while that register might be able to do 8 additions across 8 8-bit numbers at once, it can only do 2 additions across 2 32-bit numbers in that same time frame, giving a totally different rate of operations per second).
This also brings up the important point into what kinds of data and instructions we are talking about. Single instruction, single data instructions are quite different from SIMD (single instruction, multiple data). The former is the main way CPUs do their calculations, but when they need to operate on vectors of data (as if they were a mini-GPU), they can perform a single instruction (such as addition) across several pieces of data (such as 8 numbers that are each 8 bits in size) at once. But you only get a benefit from that if your application needs to do that kind of thing. So again, performance depends on the workload.
To make matters more complicated, the CPU can do more than one thing at once. Which things it can do at once are determined by its architecture, and how fast software runs can depend on how well that software was optimized for that CPU architecture.