The humble PDF is becoming a problem for AI

Skye Jacobs · Feb 26, 2026

Looking ahead: Three decades after Adobe introduced the Portable Document Format – a design intended to preserve the appearance of printed pages across devices – PDFs are facing pressure from a completely different kind of reader: artificial intelligence. The same fixed layouts that made PDFs indispensable to human users now make them difficult for large language models to interpret. Unlike web pages or plain-text files, columns, embedded graphics, and hidden metadata in PDFs often confuse machine parsing systems trained to process linear text.

Researchers and developers working with large language models say these structural quirks introduce subtle but significant errors. An AI that reads lines strictly from left to right may stumble over multi-column scientific papers or misinterpret footers as part of the main text. These parsing issues can cascade into so-called "hallucinations," where a model produces inaccurate summaries or fabricates details.

Unlike basic text formats, PDFs are not built around logical document objects but around graphical coordinates – every letter is placed precisely where it should appear on a page. This design, ideal for visual consistency, means that extracting meaning requires recognizing text order, hierarchy, and context that are not explicitly represented in the file. Accessibility software for visually impaired users faces similar barriers, as do data-analysis tools that attempt to scrape tables or figures from reports.

Security adds another layer of complexity. Cybersecurity firm Check Point reports that roughly one in five email-based attacks uses infected PDFs, exploiting the format's ability to embed scripts and links. The PDF's popularity ensures it remains both a universal exchange medium and a favored vector for malware.

Some entrepreneurs see the AI stumbling block as an opportunity to rebuild the infrastructure of digital documents. Factify, an Israeli startup led by Matan Gavish, is developing a format designed from the ground up to interact smoothly with large language models. "In the end, it's a closed and inefficient object, and one that's not suitable for the era of AI automation," he told Globes.

"We are building a new system, a new format, a data layer, and user experience interface applications. In order to build a connected, smart document that can support changes, it is necessary to build a lot of things from scratch. I don't know anyone else who has thought about and gone big on this."

Others argue the problem lies with AI systems rather than the format itself. Duff Johnson, who heads the nonprofit PDF Association, says developers can build models and tools that better interpret the specification rather than discard it.

Adobe has already embedded an AI assistant into Acrobat, its long-running reader application, designed to summarize, query, and extract information from documents. Google has introduced similar support within its Gemini developer tools, offering methods to convert PDFs into model-friendly text structures.

Despite its critics – and the format's occasional clumsiness on smartphones – the PDF has shown remarkable endurance. Since its standardization and Adobe's decision to relinquish full control in 2008, roughly 2.5 trillion PDFs are estimated to be in circulation worldwide, spanning tax filings, research papers, government forms, and more. Each document reflects a promise that its visual appearance can outlast any single device.

Whether that promise remains viable in the AI era depends on how quickly the format and surrounding tools evolve. For now, even as startups experiment with alternatives, the PDF remains the common language of digital paperwork – one that machines, despite their growing intelligence, are still learning to read.

Permalink to story:

The humble PDF is becoming a problem for AI

kingmustard · Feb 26, 2026

Good.

BlackyNoir · Feb 26, 2026

It's not like we've been ranting about PDF usability and readability as anything but an intermediate format for printers for... decades now. And how it's used, as the default document format, is horrendous for any humans. Oh wait, we did.

But if it can screw AI a bit, at least there's a bit of light in that dark hole of a shitty document format.

As to better format, no need to reinvent the wheel. epub exist, and works perfectly fine for documents, including semantics (if it's properly edited, which is not that common unfortunately). OpenDocument works very well too. And they aren't proprietary bullshit.

Underdog · Feb 26, 2026

Gosh! Now we have to dumb down the ubiquitous document format so that the great AI can cope with it.

LordVile95 · Feb 26, 2026

BlackyNoir said:
It's not like we've been ranting about PDF usability and readability as anything but an intermediate format for printers for... decades now. And how it's used, as the default document format, is horrendous for any humans. Oh wait, we did.

But if it can screw AI a bit, at least there's a bit of light in that dark hole of a shitty document format.

As to better format, no need to reinvent the wheel. epub exist, and works perfectly fine for documents, including semantics (if it's properly edited, which is not that common unfortunately). OpenDocument works very well too. And they aren't proprietary bullshit.

I personally like PDFs, I can make a document, convert it to PDF and then no one can change or edit it

kingmustard · Feb 26, 2026

LordVile95 said:
I personally like PDFs, I can make a document, convert it to PDF and then no one can change or edit it

Course they can. Plenty of online editors that bypass any protections.

Anton Longshot · Feb 26, 2026

Those poor LLM's!
We must save them. Outlaw PDF's before it's too late!
Force Adobe to convert all PDF's into Word files, PowerPoint or both.
MANUALLY. That will teach them not to mess with AI!

Yeah OK I'll admit I'm getting tired of the current #1 subject matter. Apologies.

ScottSoapbox · Feb 26, 2026

So have AI vibe code an agentic script to make PDFs readable for AI's too-narrow text processing.

You're welcome tech bros! I'm here to help.

bviktor · Feb 26, 2026

BlackyNoir said:
It's not like we've been ranting about PDF usability and readability as anything but an intermediate format for printers for... decades now. And how it's used, as the default document format, is horrendous for any humans. Oh wait, we did.

But if it can screw AI a bit, at least there's a bit of light in that dark hole of a shitty document format.

As to better format, no need to reinvent the wheel. epub exist, and works perfectly fine for documents, including semantics (if it's properly edited, which is not that common unfortunately). OpenDocument works very well too. And they aren't proprietary bullshit.

PDF was never intended to be an "intermediate" format for printers.

Its purpose is to be immutable and retain the exact same design and format, always, on all devices, at all times. Hence the term "portable document format".

Epub is pretty much the exact opposite of this. It's designed to dynamically adjust and reformat the document according to your device and preferences. OpenDocument is for editing documents, which is, again, the opposite of what PDF does. You're confused.

And for the record, PDF came WELL before any of those two. Not that this fact had any relevance.

Hotdog1n · Feb 26, 2026

Go get em PDFs

unoficialoficial · Feb 26, 2026

The ability of AI models to retrieve accurate information from PDF documents is critical, especially for scientific documents. LMStudio for RAG uses the "nomic-embed-text-v1.5.Q4_K_M.gguf" model, which is fast and good enough. It saves this model inside the folder C:\Users\username\AppData\Local\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF.

So, if you want to dramatically improve the accuracy of the RAG ability (not only over PDFs), you can go to huggingface.co/mykor/bge-m3.gguf and download the Bge-M3-567M-Q6_K.gguf. Save and keep it in a folder and rename it to "nomic-embed-text-v1.5.Q4_K_M.gguf". Every time you update LMStudio, go and copy with overwrite the renamed Bge model from inside where you save it to that folder in the AppData path. It is a little slower, requires a little more VRAM, but it is a lot more intelligent in information retrieval from documents.

In short if you are not satisfied with the RAG accuracy for PDFs, replace the 137M parameter Nomic Q4 model that LMStudio uses with the 567M parameter Bge Q6 model. Both models support up to 8192 tokens.

p51d007 · Feb 26, 2026

I know it adds to the size of a pdf, but if you have a really complex PDF with lots of photos and what not, then DON'T flatten the document. That would probably give the AI system a heart attack with that many layers.

Squid Surprise · Feb 26, 2026

LordVile95 said:
I personally like PDFs, I can make a document, convert it to PDF and then no one can change or edit it

Unless they print to Microsoft PDF and then open in Word… or about 10,000 other ways to get around it…

LordVile95 · Feb 27, 2026

kingmustard said:
Course they can. Plenty of online editors that bypass any protections.

Depends on the formatting used and the original program

LordVile95 · Feb 27, 2026

Squid Surprise said:
Unless they print to Microsoft PDF and then open in Word… or about 10,000 other ways to get around it…

Depends on the formatting and if said person is computer literate enough to do much more than use basic word functionality and don’t really know what a PDF actually is

mbrowne5061 · Feb 27, 2026

BlackyNoir said:
...As to better format, no need to reinvent the wheel. epub exist, and works perfectly fine for documents, including semantics (if it's properly edited, which is not that common unfortunately). OpenDocument works very well too. And they aren't proprietary bullshit.

Using EPUB for technical schematics and drawings sounds like a terrible idea.

"We have two EPUB-format drawings that both say 'Rev C' on them, but they have multiple differences between the two because someone edited at least one of these. Which doc is correct? Is either one correct? Quick, someone go get the original CAD file and output a new Rev C drawing - as a PDF this time"

I would expect similar issues when it comes to legal or medical forms, too. The point of PDF is that it doesn't change, no matter what device you open it on. Especially if your organization's IT department is even halfway competent and implement cryptographic signatures that either prevent editing all together (while still allowing notation/comments/feedback) or at least alert a reader if a doc was changed after being signed.

Would it make more sense to digitally publish things like textbooks as EPUB? Sure, probably. Academic or industry papers, or government reports? Maybe. Legal documents, ranging from contracts, to releases, to technical data? Absolutely not.

user556 · Feb 27, 2026

I doubt the reason is as stated. It's much more likely from preexisting attempts to make copy'n'paste a painful experience. Obfuscation of text's linear ordering will screw with any automated reader that uses the embedded text rather than graphical look of the text.

DuffJohnson · Feb 27, 2026

BlackyNoir said:
As to better format, no need to reinvent the wheel. epub exist, and works perfectly fine for documents, including semantics (if it's properly edited, which is not that common unfortunately). OpenDocument works very well too. And they aren't proprietary bullshit.

EPUB's great, but it's not a general-purpose document format, and lacks many PDF features that are critical to general-purpose applications.

PDF isn't proprietary - it became an ISO standard in 2008.

Alpine7995 · Feb 27, 2026

PDFs were basically invented to freeze a page in time, and now we’re surprised that models trained on flowing text struggle with something that’s literally a bunch of coordinates on a canvas. It’s like blaming a self-driving car for not understanding a painting of a road.

techstrike · Feb 27, 2026

Thats great, let me continue to convert the rest off my emails to pdf's.
no idea that would be my best backup solution.

Coisa · Feb 27, 2026

Hooray! Can we develop more formats that are currently beyond AI's scrapping, and make them all impossible to scrap forever?. I absolutely believe that AI violates copyright, whether it is common copyright protected or legally protected. It is enraging that AI misinterprets the content and purpose of my personal website. AI summarizes it in a way that is insulting and there is no way I know of changing that.

Eflow · Feb 27, 2026

kingmustard said:
Course they can. Plenty of online editors that bypass any protections.

Not sure of the original poster's context, but I use pdfs all the time at work when I want to "publish" something. The goal isn't to stop a malicious actor, but simply to make sure that someone doesn't accidentally blow up an easily editable file.

daffy duck · Feb 28, 2026

Sorry what's the problem? Is this a joke.

Jack77 · Feb 28, 2026

WWW 4.0 --> all webpages are PDF's with links between them

Iconscious · Feb 28, 2026

ScottSoapbox said:
So have AI vibe code an agentic script to make PDFs readable for AI's too-narrow text processing.

You're welcome tech bros! I'm here to help.

I ran into this exact problem with a 35mb PDF. It equates to around 500k tokens to ingest. Sure you can run a local LLM, but that doesn’t scale easily and you need a pretty powerful computer to keep all the data in the same context window,

The humble PDF is becoming a problem for AI

Posts: 2,029 +59

Posts: 55 +77

Posts: 48 +143

Posts: 834 +610

Posts: 469 +285

Posts: 55 +77

Posts: 493 +788

Posts: 3,115 +5,316

Posts: 2,460 +3,701

Posts: 107 +178

Posts: 370 +325

Posts: 5,178 +5,362

Posts: 7,911 +8,202

Posts: 469 +285

Posts: 469 +285

Posts: 2,365 +1,555

Posts: 449 +497

Posts: 73 +89

Posts: 215 +193

Posts: 60 +98

Posts: 106 +245

Posts: 1,119 +1,027

Posts: 225 +268

Similar threads