Why We Removed All Reactive Code From Our Spring Boot Services (And Throughput Went Up)

WebFlux promised non-blocking glory. It delivered six months of stack traces nobody on the team could read. We went back to blocking IO. Latency dropped. Sanity returned.

This is the decision log. Not a narrative. The actual sequence of decisions, with dates, that took us from “rewrite everything in WebFlux” to “remove all of it.” I am publishing it because every reactive-rewrite post I read before we started was a success story, and none of them were true for us.

March 2024 — Decision: adopt WebFlux for all new services

Reasoning at the time: our services were IO-bound. They mostly called other services and a database, then returned. The reactive model promised we could handle the same load with far fewer threads, because threads would not block on IO. The benchmark in the conference talk showed 5x throughput. We had a scaling problem. This looked like the answer.

What we did not write down, and should have: we had not measured whether threads were actually our bottleneck. We assumed it. The benchmark in the talk used a synthetic workload. We did not check whether ours resembled it.

Decision made by: me, plus the architect, in a 45-minute meeting. We were confident. That confidence is the first thing I would change.

April 2024 — Decision: rewrite the order service first

Reasoning: the order service was our highest-traffic service. If reactive helped anywhere, it would help here. We converted controllers to return Mono and Flux. We swapped RestTemplate for WebClient. We swapped the JDBC repository for R2DBC, because a reactive controller calling a blocking JDBC driver blocks the event loop, which defeats the entire point, and we were thorough.

First problem, week two: R2DBC did not support a query we needed. A correlated subquery with a window function. The JDBC driver handled it in one line. R2DBC required us to pull data into the application and process it in memory. We wrote 60 lines to replace 1. We told ourselves this was an edge case.

It was not an edge case. It was the first instance of a pattern: every reactive library is a less-mature, less-complete version of the blocking library it replaces. We met this pattern eleven more times over the next six months.

May 2024 — Decision: continue the rewrite despite friction

Reasoning: sunk cost, although we did not call it that. We had converted the order service. Reverting felt like admitting the March decision was wrong. We pressed on. We converted the payment service and the inventory service.

What broke: the team’s ability to debug. A stack trace in reactive code does not show you the logical call path. It shows you the internals of the reactor scheduler. reactor.core.publisher.FluxFlatMap.subscribe, reactor.core.publisher.MonoFlatMap, forty frames of operator plumbing, and somewhere in there, one frame of our code, if you are lucky. A NullPointerException that would have taken thirty seconds to locate in blocking code took an afternoon. We told ourselves we would get better at reading reactive stack traces. We did, slightly. It never got good.

June 2024 — Decision: add reactor-tools and BlockHound to find blocking calls

Reasoning: we discovered, in production, that one of our “reactive” services was blocking the event loop. Someone had called a blocking library inside a map() operator. Under load, the entire service stalled, because the event loop has very few threads and one blocked thread is catastrophic in a way it never is in a thread-per-request model.

BlockHound is a tool whose entire reason for existing is to detect, at runtime, when you accidentally block an event loop thread. The fact that this tool needs to exist is the warning. In the blocking model, blocking is the model. In the reactive model, a single accidental blocking call anywhere in a chain, in any library, in any transitive dependency, silently destroys your throughput, and you need a dedicated instrumentation agent to find it.

We added BlockHound. It found four blocking calls in our own code and two inside a third-party library we could not change. We worked around the two. We were now maintaining workarounds for a problem that did not exist before the rewrite.

July 2024 — Decision: keep going, but stop converting new services

Reasoning: the cost was now visible enough that we stopped expanding the blast radius, but not visible enough that we reverted. This was the worst decision in the log. It left us with a split codebase: three reactive services, fourteen blocking services, and a team that had to context-switch between two fundamentally different programming models depending on which service they opened.

New engineers had to learn both. Code review required reviewers fluent in both. Shared libraries had to expose both a blocking and a reactive API or pick one and frustrate half the team. The split was more expensive than either model would have been alone.

August 2024 — Decision: measure, finally

Reasoning: a new engineer, three weeks in, asked the question none of us had asked since March. “What was the throughput before the rewrite, and what is it now?”

Nobody knew. We had never measured the baseline. We had rewritten three services across five months to solve a scaling problem we had never quantified.

We ran the measurement. Here is what we found.

The reactive order service handled roughly the same requests per second as the old blocking version. Within noise. The theoretical thread savings were real, but threads had never been our bottleneck. Our bottleneck was the database and a downstream fraud-check API. Reactive code does not make the database faster. It just lets you wait for the slow database with fewer threads. We were not thread-starved. We were waiting on IO that was slow regardless of how many threads waited on it.

Worse: p99 latency was higher in the reactive version. Not lower. Higher, by 30 to 45 milliseconds. The reactive operator overhead, the scheduler hops, the additional allocation, all of it added latency on every request. The throughput was the same. The latency was worse. The complexity was dramatically higher. We had paid a large cost for a negative return.

If you are about to rewrite something for performance and you have not measured the baseline, the Production Latency Debugger is the document I wish the August version of me had read in March. The method for finding your actual bottleneck before you rewrite anything, the measurements that tell you whether threads or IO is your real constraint, and how to read the difference in minutes instead of five months. $25. It is the single cheapest insurance against the mistake this entire log documents.

September 2024 — Decision: pilot a revert on one service

Reasoning: the measurement removed the sunk-cost emotion. The data said reactive was costing us latency and complexity for no throughput gain. We picked the inventory service, the least risky of the three, and reverted it to blocking. Mono/Flux controllers back to plain return types. WebClient back to RestTemplate. R2DBC back to JDBC.

Result after two weeks in production: p99 latency dropped 38ms. Throughput unchanged. The number of frames in a typical stack trace dropped from 40+ to 6. The number of “I cannot figure out where this error is coming from” Slack messages about the inventory service dropped to zero.

The revert was less work than the original conversion, because blocking code is simpler and the libraries are more complete. The thing we had been afraid of, reverting, was easier than the thing we had been proud of, converting.

October 2024 — Decision: revert the remaining two services

Reasoning: the pilot worked. The data was unambiguous. We reverted the payment service and the order service over the following two months, service by service, never breaking production, canarying each change.

We documented one genuine loss, to be honest about it. There was a single endpoint, a server-sent-events stream for a live dashboard, where the reactive model was genuinely the right tool. Streaming a long-lived response to thousands of concurrent clients is exactly the workload reactive was built for. We kept that one endpoint reactive. One endpoint. Out of three services. That ratio is the real lesson.

December 2024 — Decision: ban WebFlux in new services by default

Reasoning: not because reactive is bad. Because reactive is a specialized tool for a specific problem (very high concurrency with long-lived or streaming connections, where thread-per-request genuinely does not scale), and our codebase had exactly one place that matched that description. Making it the default meant paying its complexity tax everywhere to benefit one endpoint.

The new rule: blocking by default. Reactive requires a written justification, a measured baseline showing thread exhaustion is the actual bottleneck, and an architect sign-off. In the eighteen months since, that bar has been cleared zero times. The one streaming endpoint remains the only reactive code we run.

What the log adds up to

Read the dates back. March: confident decision, no measurement. April through July: mounting cost, no measurement, decisions driven by sunk cost. August: someone finally measures. September onward: the data makes every subsequent decision easy.

The entire expensive arc between March and August existed because we made an architectural decision based on a conference benchmark instead of our own numbers, and then defended it emotionally instead of measuring it. The reactive rewrite was not the mistake. The mistake was the five months between adopting it and checking whether it worked.

I want to be precise about the technical conclusion, separate from the process one.

For an IO-bound service that is bottlenecked on a slow database or a slow downstream, reactive programming does not increase throughput, because throughput is bounded by the slow dependency, not by your thread count. It does increase latency, because of operator and scheduler overhead. It does dramatically increase debugging difficulty, because stack traces stop describing your logical flow. It does fragment your team, if adopted partially. It does require specialized tooling (BlockHound) to detect a failure mode (accidental blocking) that does not exist in the blocking model at all.

Reactive earns its cost in exactly one situation: when you have so many concurrent connections that one-thread-per-request genuinely exhausts memory or scheduler capacity, and especially when those connections are long-lived or streaming. That situation is real. It is also rare. If you are not certain you are in it, you are not in it.

What we run now

Blocking Spring MVC. RestTemplate and the blocking WebClient usage where convenient. JDBC with a properly tuned HikariCP pool. Virtual threads, since we moved to a JDK that supports them, which give us most of the thread-scaling benefit reactive promised, with none of the programming-model cost, because the code still reads top to bottom and the stack traces still describe reality.

That last point deserves its own sentence. Virtual threads delivered the actual benefit we went to reactive for, lots of concurrent IO-bound requests without lots of platform threads, while letting us keep blocking code that any engineer can read and debug. The thing we rewrote three services to get, we eventually got for free, by changing a runtime flag, in code we never had to make harder to read.

The one streaming endpoint stays reactive. It is the right tool there. Everywhere else, boring blocking code, faster p99, six-frame stack traces, and a team that does not have to know two programming models to review a pull request.

The honest part

Reactive programming is not a mistake. Project Reactor is excellent engineering. The people who built it are smarter than me and solved genuinely hard problems.

WebFlux was the wrong default for us. Not wrong everywhere. Wrong for an IO-bound CRUD-and-orchestration backend that was never thread-bound in the first place. We adopted it because it was impressive, not because we had measured a problem it solved. We kept it because reverting felt like an admission. We removed it because, eventually, we measured, and the measurement did not care about our feelings.

If you are considering a reactive rewrite, the only sentence from this entire log that matters is the August one. Measure your baseline first. Find out whether threads are actually your bottleneck. If they are not, no amount of non-blocking elegance will help you, and you will spend five months and a quarter of latency learning what one afternoon of measurement would have told you.

Still planning that WebFlux migration? Maybe it is right for you. Some workloads genuinely need it. Just measure before March, not in August. I did it the other way. This log is the receipt.

One service at a time. Then, ideally, none.

I would honestly like to read your reactive rewrite story in the comments, especially if it went the other way and reactive was the right call. Those stories exist. I just was not in one.

Everything I learned across this rewrite and the dozen other production decisions that shaped how I build backends now is collected in one place. The architectural judgment, the measurement discipline, the patterns that survive contact with real load, the mistakes documented so you do not have to make them in production.

The Ultimate Production Engineering System

$199. The most complete thing I have written. Built from the incidents in this series, not from theory.

More from the production trenches: substack.com/@devrimozcay1


Why We Removed All Reactive Code From Our Spring Boot Services (And Throughput Went Up) was originally published in Javarevisited on Medium, where people are continuing the conversation by highlighting and responding to this story.

This post first appeared on Read More