Extracting content, and retaining all the semantic information presented in texts, tables, charts, graphs, and other visual elements in a manner that is most suitable for consumption by LLMs, especially when used in conjunction with RAG is still a hard problem.
There are many compnanies, research teams and open source projects attempting to solve this at the document level and as a pipeline.
Some of the ones that are the top of my head right now (2025-08) are:
Open Source Projects
- Docling by IBM Zurich Research, and SmolDocling; see (Team, 2024)
- Dolphin by ByteDance; see (Feng et al., 2025)
- roe-ai/vectorless by ^e9eac9 — “PDF chatbot that uses no vector embeddings or traditional RAG. Instead, it leverages Large Language Models for intelligent document selection and page relevance detection, providing a completely stateless and privacy-first experience.”
Supporting Libraries
- pypdf2
- GitHub - pypdfium2-team/pypdfium2: Python bindings to PDFium, reasonably cross-platform. (mentioned in docling paper, as well as used in datalab-to/marker)
- todo (one used in tensorlake)
Companies
See ^fb946e
References
Feng, H., Wei, S., Fei, X., Shi, W., Han, Y., Liao, L., Lu, J., Wu, B., Liu, Q., Lin, C., & others. (2025). Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting. arXiv Preprint arXiv:2505.14059.
Team, D. S. (2024). Docling Technical Report (1.0.0) [Techreport]. https://doi.org/10.48550/arXiv.2408.09869