AI flunks logic test: Multiple studies reveal illusion of reasoning

Cal Jeffrey

Posts: 4,482   +1,609
Staff member
Bottom line: More and more AI companies say their models can reason. Two recent studies say otherwise. When asked to show their logic, most models flub the task – proving they're not reasoning so much as rehashing patterns. The result: confident answers, but not intelligent ones.

Apple researchers have uncovered a key weakness in today's most hyped AI systems – they falter at solving puzzles that require step-by-step reasoning. In a new paper, the team tested several leading models on the Tower of Hanoi, an age-old logic puzzle, and found that performance collapsed as complexity increased.

The Tower of Hanoi puzzle is simple: move a stack of disks from one peg to another while following rules about order and disk size. For humans, it's a classic test of planning and recursive logic. For language models trained to predict the next token, the challenge lies in applying fixed constraints across multiple steps without losing track of the goal.

Apple's researchers didn't just ask the models to solve the puzzle – they asked them to explain their steps. While most handled two or three disks, their logic unraveled as the disk count rose. Models misstated rules, contradicted earlier steps, or confidently made invalid moves – even with chain-of-thought prompts. In short, they weren't reasoning – they were guessing.

These findings echo a study from April when researchers at ETH Zurich and INSAIT tested top AI models on problems from the 2025 USA Mathematical Olympiad – a competition requiring full written proofs. Out of nearly 200 attempts, none produced a perfect solution. One of the stronger performers, Google's Gemini 2.5 Pro, earned 24 percent of the total points – not by solving 24 percent of problems, but through partial credits on each attempt. OpenAI's o3-mini barely cleared 2 percent.

The models didn't just miss answers – they made basic errors, skipped steps, and contradicted themselves while sounding confident. In one problem, a model started strong but excluded valid cases without explanation. Others invented constraints based on training quirks, such as always boxing final answers – even when it didn't fit the context.

Gary Marcus, a longtime critic of AI hype, called Apple's findings "pretty devastating to large language models."

"It is truly embarrassing that LLMs cannot reliably solve Hanoi," he wrote. "If you can't use a billion dollar AI system to solve a problem that Herb Simon one of the actual 'godfathers of AI,' solved with AI in 1957, and that first semester AI students solve routinely, the chances that models like Claude or o3 are going to reach AGI seem truly remote."

Even when given explicit algorithms, model performance didn't improve. The study's co-lead Iman Mirzadeh put it bluntly:

"Their process is not logical and intelligent."

The results suggest what looks like reasoning is often just pattern matching – statistically fluent but not grounded in logic.

Not all experts were dismissive. Sean Goedecke, a software engineer specializing in AI systems, saw the failure as revealing.

"The model immediately decides 'generating all those moves manually is impossible,' because it would require tracking over a thousand moves. So it spins around trying to find a shortcut and fails," he wrote in his analysis of the Apple study. "The key insight here is that past a certain complexity threshold, the model decides that there's too many steps to reason through and starts hunting for clever shortcuts. So past eight or nine disks, the skill being investigated silently changes from 'can the model reason through the Tower of Hanoi sequence?' to 'can the model come up with a generalized Tower of Hanoi solution that skips having to reason through the sequence?'"

Rather than proving models are hopeless at reasoning, Goedecke suggested the findings highlight how AI systems adapt their behavior under pressure – sometimes cleverly, sometimes not. The failure isn't just in step-by-step reasoning but in abandoning the task when it becomes too unwieldy.

Tech companies often highlight simulated reasoning as a breakthrough. The Apple paper confirms that even models fine-tuned for chain-of-thought reasoning tend to hit a wall once cognitive load grows – for example, when tracking moves beyond six disks in Tower of Hanoi. The models' internal logic unravels, with some only managing partial success by mimicking rational explanations. Few display a consistent grasp of cause and effect or goal-directed behavior.

The results of the Apple and ETH Zurich studies stand in stark contrast to how companies market these models – as capable reasoners able to handle complex, multi-step tasks. In practice, what passes for reasoning is often just advanced autocomplete with extra steps. The illusion of intelligence arises from fluency and formatting, not true insight.

The Apple paper stops short of proposing sweeping fixes. However, it aligns with growing calls for hybrid approaches that combine large language models with symbolic logic, verifiers, or task-specific constraints. These methods may not make AI truly intelligent, but they could help prevent confidently wrong answers from being presented as facts.

Until such advances materialize, simulated reasoning is likely to remain what the name implies: simulated. It is useful – sometimes impressive – but far from genuine intelligence.

Permalink to story:

 
AI is literally in its infancy… that it can’t perform up to the standards we’ve set from SF novels shouldn’t shock anyone…
Let’s see how they do in 10 years… then 20… the. 30….
 
When we don't really understand the human mind all that well, it's going to be a while before we figure out algorithms, circuits, and programs to actually be able ti reason and imagine.

Meanwhile, we've already discovered that with shoddy algorithms and LLMs, we can create machine learning that can make some amazing errors, insist they are fact, and lie about it.

"AI" is an extremely useful tool that can sort through massive volumes of information and calculate millions of permutations per second. .With human oversight and realistic expectations, it can do a lot of good. However both oversight and realistic expectations are in short supply during a gold rush.
 
Last edited:
The current brute force approach of just throwing Exabytes of data at the algorithm to learn is doomed to failure. Gargantuan amounts of energy and models that still showing the same basic flaws after years and in many cases getting worse with more training.

Neuromorphic computing is the only way to go IMO.
 
Ah, the world is waking up to what AI really is. A steaming pile of :poop: that Nvidia and others think they are going to use to fleece everyone of their money by trying to convince everyone they cannot live without it.

IMO, its about time everyone wakes up to the Fad of AI and grows out of it.
 
It’s honestly refreshing to see studies calling out the reasoning gap so clearly. Just because a model can sound like a genius doesn’t mean it thinks like one — and too many companies lean hard on that illusion.

If a model can’t consistently follow an algorithmic process with clear rules, it’s not reasoning — it’s throwing statistically likely words at the problem.
 
"The key insight here is that past a certain complexity threshold, the model decides that there's too many steps to reason through and starts hunting for clever shortcuts.'"

It's not clever if the shortcut is just a trial'n'error fluke. It's a clever shortcut when logically reasoned.
 
Having used the github copilot, I can tell that it absolutely can not work outside it’s traning data. I’m working on quite niche industrial software and it has no clue what we are doing and why or where its going. It’s a simple program and a human could map it and grasp its basic structure fairly easy.

But if I ask it to write a function to solve if a vector is inside a triangle, it does it instantly right and even quotes the authors of a particular algorithm.

I do like it alot, it has completely removed any need for me to google up functions or sample code. It’s just not going to write the program for me anytime soon.

Clearly these AI are not built to construct, track and manipulate mental objects. These reasoning steps are just iteration of its output, it is strictly tied to the text input and output.
 
AI just needs intelligence now to recognise it is bluffing it (or is 'intellectually' guessing) and tell us this is the case. Here it should have said it has abandand all 'reasoning/training' as it would take too long but here is my random guess - take it or leave it.
 
While none of the models performed better than an average child; it is interesting to not that the "Claud + Thinking" measurably outperformed the nonreasoning model. Meaning it's doing something. We're still in an age of rapid development; specialized models could stack rings better than humans within a decade. Even if it's not technically logic.
 
Maybe the problem is that we use human terms to describe certain processes (intelligence, reasoning) and that creates false expectations. LLMs are not intelligent in the human way, but are a very powerful tool that can solve hanoi with ease. You only need to prompt it appropriately. It took claude a few seconds to write and run the tests. So, I'm sorry to say this, but, dear apple researchers, you are holding it wrong :D.

"write a simple program in python that solves the "Tower of Hanoi" problem. It should take as input the initial state as a matrix where the column represents the peg and the data 1,2,3, etc is the disk size. 0 represents empty. the program should show each move. as an intermediary state. also create a few test cases with up to 4 discs to see if it works."
 
Maybe the problem is that we use human terms to describe certain processes (intelligence, reasoning) and that creates false expectations. LLMs are not intelligent in the human way, but are a very powerful tool that can solve hanoi with ease. You only need to prompt it appropriately. It took claude a few seconds to write and run the tests. So, I'm sorry to say this, but, dear apple researchers, you are holding it wrong :D.

"write a simple program in python that solves the "Tower of Hanoi" problem. It should take as input the initial state as a matrix where the column represents the peg and the data 1,2,3, etc is the disk size. 0 represents empty. the program should show each move. as an intermediary state. also create a few test cases with up to 4 discs to see if it works."
Since Apple were not the first to LLM AI and now playing catch-up - an embarrassing state of affairs for them - no longer trend setters, this may just be sour grapes.
 
AI is "lazy" to think? Really? Who would have thought AI will mimick the perspective and behaviour of its developers?
The ones who are so damn lazy in thinking for themselve that they potently create a tool, essentially to eliminate their existence in the field, by their very own tool.
 
AI is "lazy" to think? Really? Who would have thought AI will mimick the perspective and behaviour of its developers?
The ones who are so damn lazy in thinking for themselve that they potently create a tool, essentially to eliminate their existence in the field, by their very own tool.
Except it isn't lazy... that's just humans pushing their own perspectives on a machine. The machine was coded to work "efficiently" - so when it's poorly coded/executed, this comes off as "laziness".

In 10 years (or less), no one will be making this mistake...
 
Back