Extracting content, and retaining all the semantic information presented in texts, tables, charts, graphs, and other visual elements in a manner that is most suitable for consumption by LLMs, especially when used in conjunction with RAG is still a hard problem.

There are many compnanies, research teams and open source projects attempting to solve this at the document level and as a pipeline.

Some of the ones that are the top of my head right now (2025-08) are:

Open Source Projects

  • Docling by IBM Zurich Research, and SmolDocling; see (Team, 2024)
  • Dolphin by ByteDance; see (Feng et al., 2025)
  • roe-ai/vectorless by ^e9eac9 — “PDF chatbot that uses no vector embeddings or traditional RAG. Instead, it leverages Large Language Models for intelligent document selection and page relevance detection, providing a completely stateless and privacy-first experience.”

Supporting Libraries

Companies

See ^fb946e

References

Feng, H., Wei, S., Fei, X., Shi, W., Han, Y., Liao, L., Lu, J., Wu, B., Liu, Q., Lin, C., & others. (2025). Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting. arXiv Preprint arXiv:2505.14059.
Team, D. S. (2024). Docling Technical Report (1.0.0) [Techreport]. https://doi.org/10.48550/arXiv.2408.09869