The humble PDF is becoming a problem for AI

Skye Jacobs

Posts: 1,913   +58
Staff
Looking ahead: Three decades after Adobe introduced the Portable Document Format – a design intended to preserve the appearance of printed pages across devices – PDFs are facing pressure from a completely different kind of reader: artificial intelligence. The same fixed layouts that made PDFs indispensable to human users now make them difficult for large language models to interpret. Unlike web pages or plain-text files, columns, embedded graphics, and hidden metadata in PDFs often confuse machine parsing systems trained to process linear text.

Researchers and developers working with large language models say these structural quirks introduce subtle but significant errors. An AI that reads lines strictly from left to right may stumble over multi-column scientific papers or misinterpret footers as part of the main text. These parsing issues can cascade into so-called "hallucinations," where a model produces inaccurate summaries or fabricates details.

Unlike basic text formats, PDFs are not built around logical document objects but around graphical coordinates – every letter is placed precisely where it should appear on a page. This design, ideal for visual consistency, means that extracting meaning requires recognizing text order, hierarchy, and context that are not explicitly represented in the file. Accessibility software for visually impaired users faces similar barriers, as do data-analysis tools that attempt to scrape tables or figures from reports.

Security adds another layer of complexity. Cybersecurity firm Check Point reports that roughly one in five email-based attacks uses infected PDFs, exploiting the format's ability to embed scripts and links. The PDF's popularity ensures it remains both a universal exchange medium and a favored vector for malware.

Some entrepreneurs see the AI stumbling block as an opportunity to rebuild the infrastructure of digital documents. Factify, an Israeli startup led by Matan Gavish, is developing a format designed from the ground up to interact smoothly with large language models. "In the end, it's a closed and inefficient object, and one that's not suitable for the era of AI automation," he told Globes.

"We are building a new system, a new format, a data layer, and user experience interface applications. In order to build a connected, smart document that can support changes, it is necessary to build a lot of things from scratch. I don't know anyone else who has thought about and gone big on this."

Others argue the problem lies with AI systems rather than the format itself. Duff Johnson, who heads the nonprofit PDF Association, says developers can build models and tools that better interpret the specification rather than discard it.

Adobe has already embedded an AI assistant into Acrobat, its long-running reader application, designed to summarize, query, and extract information from documents. Google has introduced similar support within its Gemini developer tools, offering methods to convert PDFs into model-friendly text structures.

Despite its critics – and the format's occasional clumsiness on smartphones – the PDF has shown remarkable endurance. Since its standardization and Adobe's decision to relinquish full control in 2008, roughly 2.5 trillion PDFs are estimated to be in circulation worldwide, spanning tax filings, research papers, government forms, and more. Each document reflects a promise that its visual appearance can outlast any single device.

Whether that promise remains viable in the AI era depends on how quickly the format and surrounding tools evolve. For now, even as startups experiment with alternatives, the PDF remains the common language of digital paperwork – one that machines, despite their growing intelligence, are still learning to read.

Permalink to story:

 
It's not like we've been ranting about PDF usability and readability as anything but an intermediate format for printers for... decades now. And how it's used, as the default document format, is horrendous for any humans. Oh wait, we did.

But if it can screw AI a bit, at least there's a bit of light in that dark hole of a shitty document format.

As to better format, no need to reinvent the wheel. epub exist, and works perfectly fine for documents, including semantics (if it's properly edited, which is not that common unfortunately). OpenDocument works very well too. And they aren't proprietary bullshit.
 
It's not like we've been ranting about PDF usability and readability as anything but an intermediate format for printers for... decades now. And how it's used, as the default document format, is horrendous for any humans. Oh wait, we did.

But if it can screw AI a bit, at least there's a bit of light in that dark hole of a shitty document format.

As to better format, no need to reinvent the wheel. epub exist, and works perfectly fine for documents, including semantics (if it's properly edited, which is not that common unfortunately). OpenDocument works very well too. And they aren't proprietary bullshit.

I personally like PDFs, I can make a document, convert it to PDF and then no one can change or edit it
 
It's not like we've been ranting about PDF usability and readability as anything but an intermediate format for printers for... decades now. And how it's used, as the default document format, is horrendous for any humans. Oh wait, we did.

But if it can screw AI a bit, at least there's a bit of light in that dark hole of a shitty document format.

As to better format, no need to reinvent the wheel. epub exist, and works perfectly fine for documents, including semantics (if it's properly edited, which is not that common unfortunately). OpenDocument works very well too. And they aren't proprietary bullshit.
PDF was never intended to be an "intermediate" format for printers.

Its purpose is to be immutable and retain the exact same design and format, always, on all devices, at all times. Hence the term "portable document format".

Epub is pretty much the exact opposite of this. It's designed to dynamically adjust and reformat the document according to your device and preferences. OpenDocument is for editing documents, which is, again, the opposite of what PDF does. You're confused.

And for the record, PDF came WELL before any of those two. Not that this fact had any relevance.
 
The ability of AI models to retrieve accurate information from PDF documents is critical, especially for scientific documents. LMStudio for RAG uses the "nomic-embed-text-v1.5.Q4_K_M.gguf" model, which is fast and good enough. It saves this model inside the folder C:\Users\username\AppData\Local\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF.

So, if you want to dramatically improve the accuracy of the RAG ability (not only over PDFs), you can go to huggingface.co/mykor/bge-m3.gguf and download the Bge-M3-567M-Q6_K.gguf. Save and keep it in a folder and rename it to "nomic-embed-text-v1.5.Q4_K_M.gguf". Every time you update LMStudio, go and copy with overwrite the renamed Bge model from inside where you save it to that folder in the AppData path. It is a little slower, requires a little more VRAM, but it is a lot more intelligent in information retrieval from documents.

In short if you are not satisfied with the RAG accuracy for PDFs, replace the 137M parameter Nomic Q4 model that LMStudio uses with the 567M parameter Bge Q6 model. Both models support up to 8192 tokens.
 
Unless they print to Microsoft PDF and then open in Word… or about 10,000 other ways to get around it…
Depends on the formatting and if said person is computer literate enough to do much more than use basic word functionality and don’t really know what a PDF actually is
 
...As to better format, no need to reinvent the wheel. epub exist, and works perfectly fine for documents, including semantics (if it's properly edited, which is not that common unfortunately). OpenDocument works very well too. And they aren't proprietary bullshit.

Using EPUB for technical schematics and drawings sounds like a terrible idea.

"We have two EPUB-format drawings that both say 'Rev C' on them, but they have multiple differences between the two because someone edited at least one of these. Which doc is correct? Is either one correct? Quick, someone go get the original CAD file and output a new Rev C drawing - as a PDF this time"

I would expect similar issues when it comes to legal or medical forms, too. The point of PDF is that it doesn't change, no matter what device you open it on. Especially if your organization's IT department is even halfway competent and implement cryptographic signatures that either prevent editing all together (while still allowing notation/comments/feedback) or at least alert a reader if a doc was changed after being signed.

Would it make more sense to digitally publish things like textbooks as EPUB? Sure, probably. Academic or industry papers, or government reports? Maybe. Legal documents, ranging from contracts, to releases, to technical data? Absolutely not.
 
I doubt the reason is as stated. It's much more likely from preexisting attempts to make copy'n'paste a painful experience. Obfuscation of text's linear ordering will screw with any automated reader that uses the embedded text rather than graphical look of the text.
 
As to better format, no need to reinvent the wheel. epub exist, and works perfectly fine for documents, including semantics (if it's properly edited, which is not that common unfortunately). OpenDocument works very well too. And they aren't proprietary bullshit.

EPUB's great, but it's not a general-purpose document format, and lacks many PDF features that are critical to general-purpose applications.

PDF isn't proprietary - it became an ISO standard in 2008.
 
PDFs were basically invented to freeze a page in time, and now we’re surprised that models trained on flowing text struggle with something that’s literally a bunch of coordinates on a canvas. It’s like blaming a self-driving car for not understanding a painting of a road.
 
Hooray! Can we develop more formats that are currently beyond AI's scrapping, and make them all impossible to scrap forever?. I absolutely believe that AI violates copyright, whether it is common copyright protected or legally protected. It is enraging that AI misinterprets the content and purpose of my personal website. AI summarizes it in a way that is insulting and there is no way I know of changing that.
 
Last edited:
Course they can. Plenty of online editors that bypass any protections.

Not sure of the original poster's context, but I use pdfs all the time at work when I want to "publish" something. The goal isn't to stop a malicious actor, but simply to make sure that someone doesn't accidentally blow up an easily editable file.
 
So have AI vibe code an agentic script to make PDFs readable for AI's too-narrow text processing.

You're welcome tech bros! I'm here to help.
I ran into this exact problem with a 35mb PDF. It equates to around 500k tokens to ingest. Sure you can run a local LLM, but that doesn’t scale easily and you need a pretty powerful computer to keep all the data in the same context window,
 
Back