Smarter Caching in AI Apps: Building Semantic Caching with Spring Boot and Ollama

As developers, we’re constantly looking for ways to make our applications faster and more efficient. When it comes to AI-powered apps, one of the biggest bottlenecks is the response time and cost of LLM queries. Every time a user asks a question, the model needs to spin up and generate a response. That might be fine for one or two users — but scale it up, and you’re looking at serious latency and infrastructure load.

The natural solution is caching. But there’s a catch: in AI apps, users rarely phrase their questions the same way twice. Traditional caches only recognize exact matches — which makes them pretty useless when someone rewords their query.

That’s where semantic caching comes in. Instead of checking for identical strings, semantic caching checks for similar meaning. This opens the door to much smarter reuse of answers, saving both time and compute.

Traditional Caching vs Semantic Caching

Let’s take a simple example:

Query 1: “What is Spring Boot?”
Query 2: “Can you explain Spring Boot?”

A traditional cache would treat these as two completely different keys and miss the cache. That means two separate (and costly) LLM calls.

Semantic caching, on the other hand, looks at embeddings — vector representations of text — and recognizes that both queries are essentially asking the same thing. Instead of running the model again, it can serve the cached response.

The result?

Faster responses.
Lower compute costs.
A better user experience.

What Are Embeddings

Embeddings are a way of turning text into numbers that represent meaning.

Think of it like plotting sentences on a huge invisible map. Two queries with similar meaning will land close to each other, even if the words are different.

For example:

“What’s the capital of France?”
“Paris is the capital of which country?”

Different words, same intent. Their embeddings would be neighbors on this semantic map.

This is the key to semantic caching: instead of comparing raw strings, we compare distances in vector space. If two embeddings are close enough, we assume the meaning is the same and reuse the cached result.

Choosing an Embedding Model

Not all embeddings are created equal. The quality of your cache depends heavily on the model you use to generate embeddings.

In our project, we used nomic-embed-text — a lightweight embedding model supported in Ollama. It outputs 768-dimensional vectors and is fast enough to run locally.

Other options worth knowing about:

mxbai-embed-large: Higher dimensionality (1024) and generally more accurate semantic similarity. Great for when precision matters more than speed.
OpenAI’s text-embedding-3-large: External API, very strong benchmark performance, but comes with cost and network latency.

Choosing an embedding model is all about trade-offs:

Do you need speed (faster embeddings, smaller vectors)?
Or accuracy (better similarity detection, larger vectors)?

For most local experiments and smaller apps, nomic-embed-text is a solid starting point.

Setting Up the Spring Boot + Ollama Project

The stack we used for this project:

Spring Boot — for wiring the application together.
Postgres + pgvector — for storing embeddings and running similarity search.
Ollama — to serve both the embedding model (nomic-embed-text) and the LLM (llama3.1:8b).

The database schema is simple. We renamed our table to llm_cache, which stores:

question – the user query.
embedding – the vector representation.
answer – the cached response.

Dependencies:

<dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-jdbc</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>

        <dependency>
            <groupId>org.postgresql</groupId>
            <artifactId>postgresql</artifactId>
        </dependency>

        <dependency>
            <groupId>com.pgvector</groupId>
            <artifactId>pgvector</artifactId>
            <version>0.1.4</version>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-validation</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>

    </dependencies>

schema :

CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS llm_cache (
 id SERIAL PRIMARY KEY,
 question TEXT,
 embedding vector(768),//These are dimesions and would change based on which embedding we chose
 answer TEXT
);

Building the Semantic Cache Service

The heart of the project is the SemanticCacheService. Its job is to:

Convert an incoming query into an embedding (via Ollama).
Check llm_cache for the nearest neighbor embedding.
If similarity ≥ threshold, return the cached answer.
If not, query the LLM, save the new embedding and answer, then return it.

This balance of reuse vs generation makes the system efficient without sacrificing accuracy.

@Service
public class SemanticCacheService {

    private final OllamaClient ollama;
    private final VectorStoreRepository repo;
    private static final double THRESHOLD = 0.80;//Similarity Threshold

    public SemanticCacheService(OllamaClient ollama, VectorStoreRepository repo) {
        this.ollama = ollama;
        this.repo = repo;
    }

    public String getAnswer(String question) {
        float[] emb = ollama.getEmbedding(question);
        Optional<Map<String,Object>> nearest = repo.findNearest(emb);
        if (nearest.isPresent()) {
            double sim = ((Number) nearest.get().get("similarity")).doubleValue();
            if (sim >= THRESHOLD) {
                return "(cached) " + nearest.get().get("answer").toString();
            }
        }
        String answer = ollama.generateAnswer(question);
        repo.save(question, emb, answer);
        return answer;
    }
}

VectorStoreRepository

@Repository
public class VectorStoreRepository {

    private final JdbcTemplate jdbc;

    public VectorStoreRepository(JdbcTemplate jdbc) {
        this.jdbc = jdbc;
    }

    public Optional<Map<String,Object>> findNearest(float[] embedding) {
        String emb = VectorUtils.toPgVectorLiteral(embedding);
        String sql = "SELECT id, question, answer, 1 - (embedding <=> ?::vector) AS similarity FROM llm_cache ORDER BY embedding <=> ?::vector LIMIT 1";
        try {
            Map<String,Object> row = jdbc.queryForMap(sql, emb, emb);
            return Optional.of(row);
        } catch (org.springframework.dao.EmptyResultDataAccessException ex) {
            return Optional.empty();
        }
    }

    public void save(String question, float[] embedding, String answer) {
        String emb = VectorUtils.toPgVectorLiteral(embedding);
        String sql = "INSERT INTO llm_cache (question, embedding, answer) VALUES (?, ?::vector, ?)";
        jdbc.update(sql, question, emb, answer);
    }
}

Putting It All Together: Caching in Action

The application exposes a simple /ask endpoint via a Spring controller.

The flow looks like this:

User asks: “What is Spring Boot?” → goes to LLM → result cached in llm_cache.
User asks: “Explain Spring Boot” → semantic cache finds it close enough → returns cached answer instantly.

This demonstrates the real-world payoff of semantic caching: reducing repeated LLM calls for semantically similar queries.

@RestController
@RequestMapping("/api")
public class FaqController {

    private final SemanticCacheService service;

    public FaqController(SemanticCacheService service) {
        this.service = service;
    }

    @PostMapping("/ask")
    public ResponseEntity<Map<String,String>> ask(@RequestBody Map<String,String> body) {
        String q = body.get("question");
        if (q == null || q.isBlank()) return ResponseEntity.badRequest().body(Map.of("error","question required"));
        String ans = service.getAnswer(q);
        return ResponseEntity.ok(Map.of("answer", ans));
    }
}

Lessons Learned and Best Practices

Some key takeaways from building this:

Threshold tuning matters — set it too high, and you’ll miss reuse opportunities. Too low, and you’ll risk serving wrong answers.
Embeddings aren’t one-size-fits-all — try a few models and compare. nomic-embed-text was good enough for local tests, while mxbai-embed-large offered better accuracy.
Database choice impacts scale — pgvector works great locally, but for larger apps, specialized vector databases like Pinecone, Weaviate, or Milvus might be a better fit.

All of the code that we saw above can you found on the github.
👉 If you found this article useful, make sure to follow us — we’ll be sharing more deep-dive tutorials, experiments, and engineering stories around LLMs, embeddings, and scalable AI system design.

Smarter Caching in AI Apps: Building Semantic Caching with Spring Boot and Ollama was originally published in Javarevisited on Medium, where people are continuing the conversation by highlighting and responding to this story.

This post first appeared on Read More