AI search engines fail accuracy test, study finds 60% error rate

Status
Not open for further replies.

Cal Jeffrey

Posts: 4,447   +1,585
Staff member
In context: It is a foregone conclusion that AI models can lack accuracy. Hallucinations and doubling down on wrong information have been an ongoing struggle for developers. Usage varies so much in individual use cases that it's hard to nail down quantifiable percentages related to AI accuracy. A research team claims it now has those numbers.

The Tow Center for Digital Journalism recently studied eight AI search engines, including ChatGPT Search, Perplexity, Perplexity Pro, Gemini, DeepSeek Search, Grok-2 Search, Grok-3 Search, and Copilot. They tested each for accuracy and recorded how frequently the tools refused to answer.

The researchers randomly chose 200 news articles from 20 news publishers (10 each). They ensured each story returned within the top three results in a Google search when using a quoted excerpt from the article. Then, they performed the same query within each AI search tool and graded accuracy based on whether the search correctly cited A) the article, B) the news organization, and C) the URL.

The researchers then labeled each search based on degrees of accuracy from "completely correct" to "completely incorrect." As you can see from the diagram below, other than both versions of Perplexity, the AIs did not perform well. Collectively, AI search engines are inaccurate 60 percent of the time. Furthermore, these wrong results were reinforced by the AI's "confidence" in them.

The study is fascinating because it quantifiably confirms what we have known for a few years – that LLMs are "the slickest con artists of all time." They report with complete authority that what they say is true even when it is not, sometimes to the point of argument or making up other false assertions when confronted.

In a 2023 anecdotal article, Ted Gioia (The Honest Broker) pointed out dozens of ChatGPT responses, showing that the bot confidently "lies" when responding to numerous queries. While some examples were adversarial queries, many were just general questions.

"If I believed half of what I heard about ChatGPT, I could let it take over The Honest Broker while I sit on the beach drinking margaritas and searching for my lost shaker of salt," Gioia flippantly noted.

Even when admitting it was wrong, ChatGPT would follow up that admission with more fabricated information. The LLM is seemingly programmed to answer every user input at all costs. The researcher's data confirms this hypothesis, noting that ChatGPT Search was the only AI tool that answered all 200 article queries. However, it only achieved a 28-percent completely accurate rating and was completely inaccurate 57 percent of the time.

ChatGPT isn't even the worst of the bunch. Both versions of X's Grok AI performed poorly, with Grok-3 Search being 94 percent inaccurate. Microsoft's Copilot was not that much better when you consider that it declined to answer 104 queries out of 200. Of the remaining 96, only 16 were "completely correct," 14 were "partially correct," and 66 were "completely incorrect," making it roughly 70 percent inaccurate.

Arguably, the craziest thing about all this is that the companies making these tools are not transparent about this lack of accuracy while charging the public $20 to $200 per month to access their latest AI models. Moreover, Perplexity Pro ($20/month) and Grok-3 Search ($40/month) answered slightly more queries correctly than their free versions (Perplexity and Grok-2 Search) but had significantly higher error rates (above). Talk about a con.

However, not everyone agrees. TechRadar's Lance Ulanoff said he might never use Google again after trying ChatGPT Search. He describes the tool as fast, aware, and accurate, with a clean, ad-free interface.

Feel free to read all the details in the Tow Center's paper published in the Columbia Journalism Review, and let us know what you think.

Permalink to story:

 
It's funny. I was tinkering around with ChatGPT many moons ago and asked it about someone who was dead. I asked it to write a eulogy. The dead person was famous and had recently died, so I wasn't trying to trick it. It wrote something very elegant. It seemed pretty good until I started fact-checking it. The main thing that clued me in was that it quoted a friend speaking of the man's passing. So I googled the remark. The friend was real, but nowhere on the internet did he ever make that comment. So I started looking at other "facts" about his life and stuff. Totally made up. All of it. ChatGPT got his nd his friend's names right and his occupation. But that was pretty much it. All the rest was made-up filler.
 
Enterprise deployments of LLM tech are far from accurate but often lead to breakthroughs. Sure, it can replace some people with menial task execution, but I have learned via daily real world experience that LLMs are neither reliably accurate, nor mathematically consistent. Every deployment requires oversight, review, maintenance, and sometimes cleanup... Just like any other software.
 
I hate the auto AI answers in search engines these days. They're almost always wrong, hallucinating things that dont exist.


Definitely room for improvement

Think with these studies - real info in details , ie what is a pass or fail - just one minor thing wrong - so would really like a better breakdown

The Big trumprecession in the room is

Vs what ???
Google - first 4 lines may be promoted crap
Google search anecdotally is deteriorating and you also needed some smarts to get the best

I wonder if you just asked - what are the latest articles on blahblah

LLM has been considered quite accurate when 10000 of papers on something it.

think another problem is News sites often just aggregators - even outlet may not quote source, there sometimes seems to be no source and each ref each other - drink X cups of water a day
least TS normally posts link if this really was a discussion site :)
 
The moral of the story is, AI is still far from being useful. It’s good at somethings, but there’s severe shortcomings in the form of hallucinations. To me, it does not matter it’s 60 or 100% inaccurate. If it’s 60% inaccurate for example, it just means you have to review every single word or code, which kind of means it’s like redoing the entire work, while trying to identify which part of factual.
 
As if anyone here could possibly be surprised that "AI" is a fraud and a money-laundering scam.

There is no "AI" yet. It's just algorithms, the same as we've always had in computing, rebranded to milk sucker investors and dupe everyone into thinking the tech is even remotely real.
 
Quick! Do I press red or green button to defuse this bomb?
Green, accuracy is approximately 66%
 
As if anyone here could possibly be surprised that "AI" is a fraud and a money-laundering scam.

There is no "AI" yet. It's just algorithms, the same as we've always had in computing, rebranded to milk sucker investors and dupe everyone into thinking the tech is even remotely real.

Ah yes, but to paraphrase a famous quote, "Ok, prove *you're* not just an algorithm."
 
I guess AI is just getting in on the current post-truth bandwagon.

It makes you wonder what the human average for accuracy might be.
 
That's a truly weird way to perform an "accuracy test".
Such study should evaluate the accuracy of the information provided in response to a query, not the ability to pinpoint a particular source containing an exact quote.
 
Don't anyone remember the rules of the internet?
Always assume it is a lie?
Things are fake?
or
The truth is never on the internet and if it is, its been downplayed or censored?

The only thing worse than a political id10T and their ideology is an AI program that searches the internet for answers. Then of course these prove it people think everything on the internet is true.
 
That's not what you use AI for. The researchers were using it to generate, essentially, footnotes. AI projects seem to be capable of taking a complex query that requires amalgamating a number of "established" facts and forming one coherent answer, thus saving you a lot of time. I've found it to be quite good at that -- amazing, actually.

Knowing when and how to use AI is the key to deriving benefit from it. Knowing how to design a research study that reflects the realities of a given situation is the key to producing an actually useful research study. The study being written about here fails.
 
Color me absolutely not surprised by this result since those who have been jumping on the AI bandwagon are only interested in one thing - beating everyone else to the market to reap "profits."

This is the latest Tech Fad. IMO, in some respects, its similar to how many jumped on the streaming bandwagon without understanding why people dumped subscription TV in favor streaming. There, also, they thought that they were going to reap untold profits.
I guess AI is just getting in on the current post-truth bandwagon.

It makes you wonder what the human average for accuracy might be.
More like the "Alternative Facts" Bandwagon
 
That's a truly weird way to perform an "accuracy test".
Such study should evaluate the accuracy of the information provided in response to a query, not the ability to pinpoint a particular source containing an exact quote.
In a scientific context, the source is just as important as the information.
 
What is accuracy rate? What is the accuracy rate of non AI search. It seems somewhat subjective at any rate. I heard Grok on a podcast and it was really bad
 
Don't anyone remember the rules of the internet?
Always assume it is a lie?
Things are fake?
or
The truth is never on the internet and if it is, its been downplayed or censored?

The only thing worse than a political id10T and their ideology is an AI program that searches the internet for answers. Then of course these prove it people think everything on the internet is true.
If its on the internet its true! Thats the only rule most abide by. Thinking is for the birds.
 
its a running joke man, have you never heard of this before?
so was mine. Like being host of the Oscars but for the Darwin Awards. I'll rephrase that better since I'm not sleepy anymore. EDIT: I guess you can't edit after someone posts.
 
My experience is that AI is garbage.

I downloaded DeepSeek and played with it for a while. What irritated me was the inability to tailor the responses. I asked for a list of ALL configuration options, including programming options. It listed 10 vague options. It was not possible to restrict the responses to just Yes/No.

A good tool should be configurable.
 
Status
Not open for further replies.
Back