PDFs for LLM

Extracting content, and retaining all the semantic information presented in texts, tables, charts, graphs, and other visual elements in a manner that is most suitable for consumption by LLMs, especially when used in conjunction with RAG is still a hard problem.

There are many compnanies, research teams and open source projects attempting to solve this at the document level and as a pipeline.

Some of the ones that are the top of my head right now (2025-08) are:

Open Source Projects

Docling by IBM Zurich Research, and SmolDocling; see (Team, 2024)
Dolphin by ByteDance; see (Feng et al., 2025)
roe-ai/vectorless by ^e9eac9 — “PDF chatbot that uses no vector embeddings or traditional RAG. Instead, it leverages Large Language Models for intelligent document selection and page relevance detection, providing a completely stateless and privacy-first experience.”

Supporting Libraries

pypdf2
GitHub - pypdfium2-team/pypdfium2: Python bindings to PDFium, reasonably cross-platform. (mentioned in docling paper, as well as used in datalab-to/marker)
todo (one used in tensorlake)

Companies

See ^fb946e

Datalab; see GitHub - datalab-to/marker: Convert PDF to markdown + JSON quickly with high accuracy

References

Feng, H., Wei, S., Fei, X., Shi, W., Han, Y., Liao, L., Lu, J., Wu, B., Liu, Q., Lin, C., & others. (2025). Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting. arXiv Preprint arXiv:2505.14059.

Team, D. S. (2024). Docling Technical Report (1.0.0) [Techreport]. https://doi.org/10.48550/arXiv.2408.09869

btbytes.com

PDFs for LLM

Open Source Projects

Supporting Libraries

Companies

References

Table of Contents

Graph View