Vision RAG: Enabling Search on Any Documents

Information comes in many shapes and forms. While retrieval-augmented generation (RAG) primarily focuses on plain text, it overlooks vast amounts of data along the way. Most enterprise knowledge resides in complex documents, slides, graphics, and other multimodal sources. Yet, extracting useful information from these formats using optical character recognition (OCR) or other parsing techniques is often low-fidelity, brittle, and expensive.

Vision RAG makes complex documents—including their figures and tables—searchable by using multimodal embeddings, eliminating the need for complex and costly text extraction. This guide explores how Voyage AI’s latest model powers this capability and provides a step-by-step implementation walkthrough.

Vision RAG: Building upon text RAG
Vision RAG is an evolution of traditional RAG built on the same two components: retrieval and generation.

In traditional RAG, unstructured text data is indexed for semantic search. At query time, the system retrieves relevant documents or chunks and appends them to the user’s prompt so the large language model (LLM) can produce more grounded, context-aware answers.

Figure 1. Text RAG with Voyage AI and MongoDB.

Text RAG with Voyage AI and MongoDB
Enterprise data, however, is rarely just clean plain text. Critical information often lives in PDFs, slides, diagrams, dashboards, and other visual formats. Today, this is typically handled by parsing tools and OCR services. Those approaches create several problems:

Significant engineering effort to handle many file types, layouts, and edge cases

Accuracy issues across different OCR or parsing setups

High costs when scaled across large document collections

Next-generation multimodal embedding models provide a simpler and more cost-effective alternative. They can ingest not only text but also images or screenshots of complex document layouts, and generate vector representations that capture the meaning and structure of that content.

Vision RAG uses these multimodal embeddings to index entire documents, slides, and images directly, even when they contain interleaved text and images. This enables them to be searchable via vector search without requiring heavy parsing or OCR. At query time, the system retrieves the most relevant visual assets and feeds them, along with the text prompt, into a vision-capable LLM to inform its answer.

Figure 2. Vision RAG with Voyage AI and MongoDB.

Vision RAG with Voyage AI and MongoDB
As a result, vision RAG enables LLM-based systems with native access to rich, multimodal enterprise data, while reducing engineering complexity and avoiding the performance and cost pitfalls associated with traditional text-focused preprocessing pipelines.

Voyage AI’s latest multimodal embedding model
The multimodal embedding model is where the magic happens. Historically, building such a system was challenging due to the modality gap. Early multimodal embedding models, such as contrastive language-image pretraining (CLIP)-based models, processed text and images using separate encoders. Because the outputs were generated independently, results were often biased toward one modality, making retrieval across mixed content unreliable. These models also struggled to handle interleaved text and images, a critical limitation for vision RAG in real-world environments.

Voyage-multimodal-3 adopts an architecture similar to modern vision-capable LLMs. It uses a single encoder for both text and visual inputs, closing the modality gap and producing unified representations. This ensures that textual and visual features are treated consistently and accurately within the same vector space.

Figure 3. CLIP-based architecture vs. voyage-multimodal-3’s architecture.

CLIP-based architecture vs. voyage-multimodal-3’s architecture
This architectural shift enables true multimodal retrieval, making vision RAG a viable and efficient solution. For more details, refer to the voyage-multimodal-3 blog announcement.

Implementation of vision RAG
Let’s take a simple example and showcase how to implement vision RAG. Traditional text-based RAG often struggles with complex documents, such as slide decks, financial reports, or technical papers, where critical information is often locked inside charts, diagrams, and figures.

By using Voyage AI’s multimodal embedding models alongside Anthropic’s vision-capable LLMs, we can bridge this gap. We will treat images (or screenshots of document pages) as first-class citizens, retrieving them directly based on their visual and semantic content and passing them to a vision-capable LLM for reasoning.

To demonstrate this, we will build a pipeline that extracts insights from the charts and figures of the GitHub Octoverse 2025 survey, which simulates the type of information typically found in enterprise data.

The Jupyter Notebook for this tutorial is available on GitHub in our GenAI Showcase repository. To follow along, run the notebook in Google Colab (or similar), and refer to this tutorial for explanations of key code blocks.

Step 1: Install necessary libraries
First, we need to set up our Python environment. We will install the voyageai client for generating embeddings and the anthropic client for our generative model.

….

This article first appeared on Read More

Related Posts