DocLang: A New AI-Friendly Document Format for Enterprise AI Systems

by Anika Shah - Technology
0 comments

DocLang: A New Open Standard to Optimize Enterprise Documents for AI

The LF AI & Data Foundation, a project under the Linux Foundation, has established a working group to develop DocLang, an open-source, AI-native document format designed to improve how large language models (LLMs) ingest and process enterprise data. Founded by a coalition including IBM, NVIDIA, Red Hat, ABBYY, HumanSignal, and Forgis, the initiative addresses the technical limitations of traditional formats like PDF, HTML, and Markdown, which were created for human consumption rather than machine parsing.

Why current document formats fail AI systems

Traditional file formats often lose vital semantic and structural information when converted into tokens for AI models. According to ABBYY, a document automation company involved in the project, formats like PDF prioritize visual rendering. When an LLM processes these files, the lack of inherent structure forces the model to perform “guesswork” to understand layouts, tables, and formulas. This ambiguity increases the risk of hallucinations—where an AI generates inaccurate information—and consumes significantly more compute resources than necessary.

From Instagram — related to Friendly Document Format, Data Foundation

The DocLang specification aims to solve this by providing a standardized, minimal XML-based markup language. By mapping document elements directly to LLM tokens on a 1-to-1 basis, the format ensures that structural relationships and metadata remain intact during the ingestion process. This approach is designed to provide a more deterministic foundation for enterprise AI pipelines, reducing the need for the brittle, custom parsers that many organizations currently build to handle varied document types.

How DocLang impacts AI performance and costs

Adopting a standardized format can lead to measurable improvements in both performance and cost-efficiency. Testing conducted by ABBYY using an IBM 2025 annual report demonstrated that a DocLang-formatted version required approximately 37% fewer input tokens compared to the standard PDF. Furthermore, the DocLang version reduced processing latency to 2.7 seconds, down from 4.2 seconds for the PDF version, while simultaneously improving data extraction accuracy.

How DocLang impacts AI performance and costs

The financial impact of these efficiencies is significant at scale. Because enterprise AI costs are often tied to the number of tokens processed, inefficient file parsing creates a hidden “token tax.” By minimizing the tokens required to describe a document’s layout and content, organizations can reduce the overhead associated with using frontier models for document-heavy workflows. The LF AI & Data Foundation emphasizes that this format is open and free, encouraging broader adoption to replace the fragmented landscape of proprietary parsing tools.

Comparison of document parsing approaches

Feature PDF / Traditional Formats DocLang
Primary Goal Visual rendering for humans Machine-readable semantic structure
Token Efficiency Lower (requires layout decryption) Higher (1-to-1 token mapping)
Data Loss Frequent (layout/metadata stripping) Minimal (lossless preservation)
System Requirement Custom, brittle parsing scripts Standardized, interoperable schema

What happens next for the DocLang standard

The DocLang working group is currently in the early stages of building out the specification and is actively inviting technology providers and enterprises to contribute. The effort builds upon earlier open-source projects like IBM’s Docling, a toolkit released in late 2024 designed to simplify document conversion into structured formats. While Docling provides the conversion mechanism, DocLang serves as the standardized output format for exchanging that data across different AI systems.

Comparison of document parsing approaches

Industry observers note that while adoption remains in its infancy, the push for standardized data formats is a natural evolution of the enterprise AI market. As companies shift from experimental AI pilots to large-scale production, the demand for reliable, cost-effective data ingestion is expected to favor open standards that mitigate the limitations of legacy office document formats.

Related Posts

Leave a Comment