AI flunks logic test: Multiple studies reveal illusion of reasoning

Cal Jeffrey

Posts: 4,491   +1,612
Staff member
Bottom line: More and more AI companies say their models can reason. Two recent studies say otherwise. When asked to show their logic, most models flub the task – proving they're not reasoning so much as rehashing patterns. The result: confident answers, but not intelligent ones.

Apple researchers have uncovered a key weakness in today's most hyped AI systems – they falter at solving puzzles that require step-by-step reasoning. In a new paper, the team tested several leading models on the Tower of Hanoi, an age-old logic puzzle, and found that performance collapsed as complexity increased.

The Tower of Hanoi puzzle is simple: move a stack of disks from one peg to another while following rules about order and disk size. For humans, it's a classic test of planning and recursive logic. For language models trained to predict the next token, the challenge lies in applying fixed constraints across multiple steps without losing track of the goal.

Apple's researchers didn't just ask the models to solve the puzzle – they asked them to explain their steps. While most handled two or three disks, their logic unraveled as the disk count rose. Models misstated rules, contradicted earlier steps, or confidently made invalid moves – even with chain-of-thought prompts. In short, they weren't reasoning – they were guessing.

These findings echo a study from April when researchers at ETH Zurich and INSAIT tested top AI models on problems from the 2025 USA Mathematical Olympiad – a competition requiring full written proofs. Out of nearly 200 attempts, none produced a perfect solution. One of the stronger performers, Google's Gemini 2.5 Pro, earned 24 percent of the total points – not by solving 24 percent of problems, but through partial credits on each attempt. OpenAI's o3-mini barely cleared 2 percent.

The models didn't just miss answers – they made basic errors, skipped steps, and contradicted themselves while sounding confident. In one problem, a model started strong but excluded valid cases without explanation. Others invented constraints based on training quirks, such as always boxing final answers – even when it didn't fit the context.

Gary Marcus, a longtime critic of AI hype, called Apple's findings "pretty devastating to large language models."

"It is truly embarrassing that LLMs cannot reliably solve Hanoi," he wrote. "If you can't use a billion dollar AI system to solve a problem that Herb Simon one of the actual 'godfathers of AI,' solved with AI in 1957, and that first semester AI students solve routinely, the chances that models like Claude or o3 are going to reach AGI seem truly remote."

Even when given explicit algorithms, model performance didn't improve. The study's co-lead Iman Mirzadeh put it bluntly:

"Their process is not logical and intelligent."

The results suggest what looks like reasoning is often just pattern matching – statistically fluent but not grounded in logic.

Not all experts were dismissive. Sean Goedecke, a software engineer specializing in AI systems, saw the failure as revealing.

"The model immediately decides 'generating all those moves manually is impossible,' because it would require tracking over a thousand moves. So it spins around trying to find a shortcut and fails," he wrote in his analysis of the Apple study. "The key insight here is that past a certain complexity threshold, the model decides that there's too many steps to reason through and starts hunting for clever shortcuts. So past eight or nine disks, the skill being investigated silently changes from 'can the model reason through the Tower of Hanoi sequence?' to 'can the model come up with a generalized Tower of Hanoi solution that skips having to reason through the sequence?'"

Rather than proving models are hopeless at reasoning, Goedecke suggested the findings highlight how AI systems adapt their behavior under pressure – sometimes cleverly, sometimes not. The failure isn't just in step-by-step reasoning but in abandoning the task when it becomes too unwieldy.

Tech companies often highlight simulated reasoning as a breakthrough. The Apple paper confirms that even models fine-tuned for chain-of-thought reasoning tend to hit a wall once cognitive load grows – for example, when tracking moves beyond six disks in Tower of Hanoi. The models' internal logic unravels, with some only managing partial success by mimicking rational explanations. Few display a consistent grasp of cause and effect or goal-directed behavior.

The results of the Apple and ETH Zurich studies stand in stark contrast to how companies market these models – as capable reasoners able to handle complex, multi-step tasks. In practice, what passes for reasoning is often just advanced autocomplete with extra steps. The illusion of intelligence arises from fluency and formatting, not true insight.

The Apple paper stops short of proposing sweeping fixes. However, it aligns with growing calls for hybrid approaches that combine large language models with symbolic logic, verifiers, or task-specific constraints. These methods may not make AI truly intelligent, but they could help prevent confidently wrong answers from being presented as facts.

Until such advances materialize, simulated reasoning is likely to remain what the name implies: simulated. It is useful – sometimes impressive – but far from genuine intelligence.

Permalink to story:

 
When we don't really understand the human mind all that well, it's going to be a while before we figure out algorithms, circuits, and programs to actually be able ti reason and imagine.

Meanwhile, we've already discovered that with shoddy algorithms and LLMs, we can create machine learning that can make some amazing errors, insist they are fact, and lie about it.

"AI" is an extremely useful tool that can sort through massive volumes of information and calculate millions of permutations per second. .With human oversight and realistic expectations, it can do a lot of good. However both oversight and realistic expectations are in short supply during a gold rush.
 
Last edited:
The current brute force approach of just throwing Exabytes of data at the algorithm to learn is doomed to failure. Gargantuan amounts of energy and models that still showing the same basic flaws after years and in many cases getting worse with more training.

Neuromorphic computing is the only way to go IMO.
 
Ah, the world is waking up to what AI really is. A steaming pile of :poop: that Nvidia and others think they are going to use to fleece everyone of their money by trying to convince everyone they cannot live without it.

IMO, its about time everyone wakes up to the Fad of AI and grows out of it.
 
It’s honestly refreshing to see studies calling out the reasoning gap so clearly. Just because a model can sound like a genius doesn’t mean it thinks like one — and too many companies lean hard on that illusion.

If a model can’t consistently follow an algorithmic process with clear rules, it’s not reasoning — it’s throwing statistically likely words at the problem.
 
"The key insight here is that past a certain complexity threshold, the model decides that there's too many steps to reason through and starts hunting for clever shortcuts.'"

It's not clever if the shortcut is just a trial'n'error fluke. It's a clever shortcut when logically reasoned.
 
Having used the github copilot, I can tell that it absolutely can not work outside it’s traning data. I’m working on quite niche industrial software and it has no clue what we are doing and why or where its going. It’s a simple program and a human could map it and grasp its basic structure fairly easy.

But if I ask it to write a function to solve if a vector is inside a triangle, it does it instantly right and even quotes the authors of a particular algorithm.

I do like it alot, it has completely removed any need for me to google up functions or sample code. It’s just not going to write the program for me anytime soon.

Clearly these AI are not built to construct, track and manipulate mental objects. These reasoning steps are just iteration of its output, it is strictly tied to the text input and output.
 
AI just needs intelligence now to recognise it is bluffing it (or is 'intellectually' guessing) and tell us this is the case. Here it should have said it has abandand all 'reasoning/training' as it would take too long but here is my random guess - take it or leave it.
 
While none of the models performed better than an average child; it is interesting to not that the "Claud + Thinking" measurably outperformed the nonreasoning model. Meaning it's doing something. We're still in an age of rapid development; specialized models could stack rings better than humans within a decade. Even if it's not technically logic.
 
Maybe the problem is that we use human terms to describe certain processes (intelligence, reasoning) and that creates false expectations. LLMs are not intelligent in the human way, but are a very powerful tool that can solve hanoi with ease. You only need to prompt it appropriately. It took claude a few seconds to write and run the tests. So, I'm sorry to say this, but, dear apple researchers, you are holding it wrong :D.

"write a simple program in python that solves the "Tower of Hanoi" problem. It should take as input the initial state as a matrix where the column represents the peg and the data 1,2,3, etc is the disk size. 0 represents empty. the program should show each move. as an intermediary state. also create a few test cases with up to 4 discs to see if it works."
 
Maybe the problem is that we use human terms to describe certain processes (intelligence, reasoning) and that creates false expectations. LLMs are not intelligent in the human way, but are a very powerful tool that can solve hanoi with ease. You only need to prompt it appropriately. It took claude a few seconds to write and run the tests. So, I'm sorry to say this, but, dear apple researchers, you are holding it wrong :D.

"write a simple program in python that solves the "Tower of Hanoi" problem. It should take as input the initial state as a matrix where the column represents the peg and the data 1,2,3, etc is the disk size. 0 represents empty. the program should show each move. as an intermediary state. also create a few test cases with up to 4 discs to see if it works."
Since Apple were not the first to LLM AI and now playing catch-up - an embarrassing state of affairs for them - no longer trend setters, this may just be sour grapes.
 
AI is "lazy" to think? Really? Who would have thought AI will mimick the perspective and behaviour of its developers?
The ones who are so damn lazy in thinking for themselve that they potently create a tool, essentially to eliminate their existence in the field, by their very own tool.
 
AI is "lazy" to think? Really? Who would have thought AI will mimick the perspective and behaviour of its developers?
The ones who are so damn lazy in thinking for themselve that they potently create a tool, essentially to eliminate their existence in the field, by their very own tool.
Except it isn't lazy... that's just humans pushing their own perspectives on a machine. The machine was coded to work "efficiently" - so when it's poorly coded/executed, this comes off as "laziness".

In 10 years (or less), no one will be making this mistake...
 
"The key insight here is that past a certain complexity threshold, the model decides that there's too many steps to reason through and starts hunting for clever shortcuts.'"

It's not clever if the shortcut is just a trial'n'error fluke. It's a clever shortcut when logically reasoned.
Except it isn't lazy... that's just humans pushing their own perspectives on a machine. The machine was coded to work "efficiently" - so when it's poorly coded/executed, this comes off as "laziness".

In 10 years (or less), no one will be making this mistake...

But in a very real sense, humans behave the same way - given a tedious problem (increasing complexity in this paper's example) - we often try to find shortcuts, and that is reflected in the training data that these models are built on. Of course, the shortcut might be developing an algorithm, which the models were provided.

Here's the thing, models aren't really code execution engines in their own right, so solving a complex problem that is better solved as an algorithm seems like a poor way to judge the model. The models weren't given the ability to do tool calling (the word "tools" only shows up once in the paper, in the context of the benchmarks being tools for measurement, so the researchers didn't even consider tool calling). Tool calling would allow the models to write and then execute some code, such as sending the code to an interpreter and getting the output back. If given the tool to execute arbitrary code, these reasoning models might perform just fine on high complexity puzzles presented in this paper (e.g. Tower of Hanoi with 6+ disks).

Even if that is the case, this doesn't take the models completely off the hook, but it does suggest a path for mitigating these model's weaknesses.
 
AI is literally in its infancy… that it can’t perform up to the standards we’ve set from SF novels shouldn’t shock anyone…
Let’s see how they do in 10 years… then 20… then 30….
In 30yrs time, there won't be enough energy on the entire planet to satisfy AI's computational requirements. Although quantum computing MIGHT mitigate some of it.
 
In 30yrs time, there won't be enough energy on the entire planet to satisfy AI's computational requirements. Although quantum computing MIGHT mitigate some of it.
There will be plenty… we’ll have fusion and other nuclear energy stations which can provide more than enough energy for centuries to come.
 
Maybe the problem is that we use human terms to describe certain processes (intelligence, reasoning) and that creates false expectations. LLMs are not intelligent in the human way, but are a very powerful tool that can solve hanoi with ease. You only need to prompt it appropriately. It took claude a few seconds to write and run the tests. So, I'm sorry to say this, but, dear apple researchers, you are holding it wrong :D.

"write a simple program in python that solves the "Tower of Hanoi" problem. It should take as input the initial state as a matrix where the column represents the peg and the data 1,2,3, etc is the disk size. 0 represents empty. the program should show each move. as an intermediary state. also create a few test cases with up to 4 discs to see if it works."
ChatGPT can do the same thing. Writing code is actually one thing that LLMs have gotten fairly decent at, especially simple algorithms like Hanoi, but that is not what they were testing. They were testing the reasoning of the models--whether they could accurately lay out the steps to solve the problems and show their work. I have not tested ChatGPT on that because I have no motivation to go through an exponentially increasing number of steps to see if it makes any errors.
 
They were testing the reasoning of the models--whether they could accurately lay out the steps to solve the problems and show their work.
There are two steps:
- generating the plan - I do not know how we can measure this in a benchmark
- executing the plan - this is already measured by various benchmarks (HELMeBench, FLAN, MT-Bench, AlpacaEval, BIG-Bench)

I do not believe these findings are new or "devastating" in any way. The "reasoning" approach has improved some of the metrics. they are far from flawless, and in some metrics the scores are well below 50%
 
Ah, the world is waking up to what AI really is. A steaming pile of :poop: that Nvidia and others think they are going to use to fleece everyone of their money by trying to convince everyone they cannot live without it.

IMO, its about time everyone wakes up to the Fad of AI and grows out of it.

I wouldn't go so far to say that. AI is a tool, and like any tool it has a limited use case. Can you use a hammer to screw in a screw? Of course not. The big problem isn't AI's limitations that make it a bunch of BS it's how the hype has grown to the point where it's being touted as a do anything device.
 
Geodecke apparently thinks in a manner I would in this: not in terms of pass and fail, but function and result. Also, it seems most of you think these instances didn't get planned or at least planned for.....


I wouldn't go so far to say that. AI is a tool, and like any tool it has a limited use case. Can you use a hammer to screw in a screw? Of course not.

The claw of a hammer might turn some screws. Needs must you know.....
 
AI just needs intelligence now to recognise it is bluffing it (or is 'intellectually' guessing) and tell us this is the case. Here it should have said it has abandand all 'reasoning/training' as it would take too long but here is my random guess - take it or leave it.
Yep. If it only had a brain!
 
I wouldn't go so far to say that. AI is a tool, and like any tool it has a limited use case. Can you use a hammer to screw in a screw? Of course not. The big problem isn't AI's limitations that make it a bunch of BS it's how the hype has grown to the point where it's being touted as a do anything device.
Why screw in a screw when you can just hammer it in? Especially if a hammer is the only tool available.
 
Back