The humble PDF is becoming a problem for AI

PDFs are structurally hostile to large language models

By Skye Jacobs February 26, 2026, 12:36 49 comments Add TechSpot

The humble PDF is becoming a problem for AI

Serving tech enthusiasts for over 25 years.
TechSpot means tech analysis and advice you can trust.

Looking ahead: Three decades after Adobe introduced the Portable Document Format – a design intended to preserve the appearance of printed pages across devices – PDFs are facing pressure from a completely different kind of reader: artificial intelligence. The same fixed layouts that made PDFs indispensable to human users now make them difficult for large language models to interpret. Unlike web pages or plain-text files, columns, embedded graphics, and hidden metadata in PDFs often confuse machine parsing systems trained to process linear text.

Researchers and developers working with large language models say these structural quirks introduce subtle but significant errors. An AI that reads lines strictly from left to right may stumble over multi-column scientific papers or misinterpret footers as part of the main text. These parsing issues can cascade into so-called "hallucinations," where a model produces inaccurate summaries or fabricates details.

Unlike basic text formats, PDFs are not built around logical document objects but around graphical coordinates – every letter is placed precisely where it should appear on a page. This design, ideal for visual consistency, means that extracting meaning requires recognizing text order, hierarchy, and context that are not explicitly represented in the file. Accessibility software for visually impaired users faces similar barriers, as do data-analysis tools that attempt to scrape tables or figures from reports.

Security adds another layer of complexity. Cybersecurity firm Check Point reports that roughly one in five email-based attacks uses infected PDFs, exploiting the format's ability to embed scripts and links. The PDF's popularity ensures it remains both a universal exchange medium and a favored vector for malware.

Some entrepreneurs see the AI stumbling block as an opportunity to rebuild the infrastructure of digital documents. Factify, an Israeli startup led by Matan Gavish, is developing a format designed from the ground up to interact smoothly with large language models. "In the end, it's a closed and inefficient object, and one that's not suitable for the era of AI automation," he told Globes.

"We are building a new system, a new format, a data layer, and user experience interface applications. In order to build a connected, smart document that can support changes, it is necessary to build a lot of things from scratch. I don't know anyone else who has thought about and gone big on this."

Others argue the problem lies with AI systems rather than the format itself. Duff Johnson, who heads the nonprofit PDF Association, says developers can build models and tools that better interpret the specification rather than discard it.

Adobe has already embedded an AI assistant into Acrobat, its long-running reader application, designed to summarize, query, and extract information from documents. Google has introduced similar support within its Gemini developer tools, offering methods to convert PDFs into model-friendly text structures.

Despite its critics – and the format's occasional clumsiness on smartphones – the PDF has shown remarkable endurance. Since its standardization and Adobe's decision to relinquish full control in 2008, roughly 2.5 trillion PDFs are estimated to be in circulation worldwide, spanning tax filings, research papers, government forms, and more. Each document reflects a promise that its visual appearance can outlast any single device.

Whether that promise remains viable in the AI era depends on how quickly the format and surrounding tools evolve. For now, even as startups experiment with alternatives, the PDF remains the common language of digital paperwork – one that machines, despite their growing intelligence, are still learning to read.

See more TechSpot in Google Add us as a preferred source and our reporting shows up first when you search.

Add TechSpot

// Related Stories

Featured on TechSpot