How to Modify Kafka Streams Topologies in Production Without Breaking Everything: A Remapping Strategy
Every Kafka engineer has experienced that stomach-dropping moment: the business requirements have changed, and your carefully designed stream topology needs a complete overhaul—but you’re already in production with downstream consumers depending on your output topics. Do you risk the “burning domino effect” of cascading failures, or is there a safer way?
The OSO engineers recently faced exactly this challenge whilst working with a client operating over 250 microservices and 1000+ Kafka topics in a massive event-driven architecture. When new business logic required introducing an additional event into an existing join operation—transforming a two-way join into a three-way merge with state stores—the traditional approach would have meant significant downtime and potential data loss.
This isn’t just a theoretical problem. When you’re managing stock execution systems that directly impact delivery logistics, every minute of downtime translates to blocked shipments and frustrated customers. The challenge was clear: how do you fundamentally redesign your stream processing logic without disrupting the intricate web of dependencies that downstream services rely upon?
The answer lies in a technique called remapping—a lightweight abstraction that decouples internal processing logic from external topic contracts. By implementing remapping on top of Kafka Streams, it becomes possible to completely redesign internal stream topologies whilst maintaining zero downtime and preserving downstream consumer stability. This approach has proven effective not only for planned topology migrations but also as a critical tool for emergency production recovery scenarios.
Understanding the Challenge: When Business Logic Changes Break Your Topology
The Original Architecture
Before diving into the solution, it’s essential to understand what makes topology changes so risky in production environments. The OSO engineers were working with a stock execution generator microservice that exemplified the typical Kafka Streams pattern. This service used the high-level DSL to join two distinct event streams: order events coming from the order management system and delivery booking confirmations from the logistics platform.
The implementation was straightforward—a windowed join with a three-month time window to account for the maximum gap between an order being placed and delivery being arranged. Both events needed to arrive before the system could generate a stock execution record and pass it downstream to the various microservices responsible for inventory management, warehouse operations, and reporting.
Topics were declared directly within Java classes using Kafka Streams’ declarative approach, which is clean and readable but creates tight coupling between service logic and stream infrastructure. The output topic, stock_executed
, had become a critical dependency for multiple downstream services. Some consumed it for real-time inventory updates, others for analytics, and still others as triggers for their own processing pipelines.
This is where the web of interdependencies becomes dangerous. Each consuming microservice had been designed with the assumption that stock_executed
would always be available, would always contain the expected schema, and would always maintain its semantic meaning. Any disruption to this contract—even temporarily—could cascade through the entire system.
The Breaking Change
The business requirement seemed simple enough on the surface: introduce a stock reservation cancellation event into the processing logic. The inventory management team needed to reserve stock in advance, and if a reservation was cancelled, that needed to be processed before executing the final stock allocation. From a business perspective, this made perfect sense. From a technical perspective, it meant tearing down the entire topology and rebuilding it from scratch.
The two-event join using Kafka Streams DSL suddenly needed to become a three-event merge. But the DSL doesn’t handle three-way joins elegantly, especially when the third event has different timing characteristics. The solution required dropping down to the Processor API and implementing custom state stores to manage the complex stateful logic of waiting for three potentially out-of-order events before producing output.
This wasn’t a minor tweak to a filter condition or a small adjustment to a mapping function. This was a fundamental architectural change that would alter how the microservice maintained state, how it processed events, and how it guaranteed ordering and consistency. The state stores would need to be rebuilt from scratch with a completely different structure to accommodate the new three-event pattern.
Why Traditional Approaches Fail
When faced with this challenge, most teams consider three standard approaches, and each has significant drawbacks that make it unsuitable for production systems operating at scale.
The Big Bang Deployment is the most straightforward but riskiest option. You shut down the old service, deploy the new one with the redesigned topology, and hope everything works. The problems are numerous: during the transition period, events arriving at the input topics aren’t being processed, creating lag that must be caught up later. The new state stores start empty, so the service lacks the historical context it needs to make correct decisions. Most critically, if something goes wrong during deployment, rolling back means losing all the state that was built during the failed attempt.
For a system managing stock execution, this approach is simply unacceptable. Every minute of downtime means orders that can’t be fulfilled, deliveries that can’t be scheduled, and inventory that can’t be allocated. Even worse, the “burning domino effect” kicks in—downstream services that depend on stock_executed
start throwing errors, their retry logic amplifies the load on the system, and soon you’re fighting fires across your entire architecture rather than just fixing one service.
Dual Running Systems sounds safer in theory. You run both the old and new topologies in parallel, gradually shifting traffic from one to the other. The resource cost is immediately apparent—you’re effectively doubling your infrastructure requirements. But the complexity is even worse. How do you prevent duplicate processing? How do you ensure that downstream consumers don’t receive the same event twice, once from each topology? How do you handle the cutover when you finally shut down the old system? And if you need to roll back, you’re stuck maintaining two versions indefinitely.
Consumer Migration pushes the problem downstream. Instead of changing your service, you ask all the consuming services to switch to a new topic with the new event structure. This sounds reasonable until you realise you’re now coordinating changes across multiple teams, each with their own priorities and release schedules. The timeline extends from days to weeks or months. The deployment risk multiplies because now you need every downstream service to deploy successfully. And you still haven’t solved the original problem—you’ve just made it everyone else’s problem too.
Each of these approaches shares a common flaw: they attempt to change too many things simultaneously. Infrastructure, state management, processing logic, and consumer contracts all change at once, creating a massive blast radius for any failure. What’s needed is a way to decompose this risky migration into smaller, reversible steps—and that’s exactly what remapping enables.
The Remapping Solution: Decoupling Internal Logic from External Contracts
What is Kafka Streams Remapping?
At its core, remapping is deceptively simple: it’s a lightweight abstraction layer that allows you to dynamically substitute topic names at deployment time through configuration rather than code changes. Instead of hardcoding topic names like stock_executed
directly in your Kafka Streams topology, you reference a logical name that gets resolved to a physical topic name based on configuration properties.
The OSO engineers implemented this using Kubernetes ConfigMaps, though the principle applies equally to any configuration management system. When the microservice starts up, it reads remapping rules from the ConfigMap that define how logical topic names map to physical Kafka topics. A simple configuration entry like kafka.topic.remap.stock_executed=stock_executed_dummy
tells the service to write to stock_executed_dummy
instead of stock_executed
whenever the topology references the logical name.
This might seem like a trivial level of indirection, but it fundamentally changes how you can manage topology migrations. The key insight is that by remapping the output topic, downstream consumers continue reading from stock_executed
without any interruption whilst your microservice safely processes into a temporary destination. You’ve effectively created an isolation boundary that protects your consumers from the chaos of your internal changes.
The implementation integrates cleanly with Kafka Streams’ existing APIs. The remapping logic intercepts topic name resolution before stream construction begins, substituting physical names for logical ones based on the active configuration. This happens transparently—your topology code doesn’t need to know about remapping, and when remapping is disabled, topic names resolve directly without any performance overhead.
How Remapping Preserves Downstream Stability
The genius of remapping lies in how it maintains the illusion of continuity for downstream consumers whilst enabling radical internal changes. Consider what happens from the consumer’s perspective during a topology migration with remapping enabled.
Before the migration begins, consumers are happily reading from stock_executed
, processing each message, and maintaining their consumer offsets as normal. When you deploy the new topology with remapping active, absolutely nothing changes from their perspective. They continue reading from stock_executed
because the remapping only affects the producer side—the microservice writing to the topic.
Meanwhile, your new topology is churning away, consuming input events, building its state stores from historical data, and writing output to stock_executed_dummy
. This parallel processing is completely invisible to downstream consumers. They experience zero interruption, zero schema changes, and zero unexpected behaviour. If they’re monitoring lag on stock_executed
, they see nothing unusual. If they’re tracking message rates, everything looks normal.
The beauty emerges when you’re ready to cut over. You’ve validated that your new topology is working correctly by examining the messages in stock_executed_dummy
. You’ve confirmed that the state stores contain the expected data. You’ve verified that the three-event merge logic produces correct results. Only then do you remove the remapping configuration and redeploy, at which point the new topology starts writing to stock_executed
directly.
From the consumer’s perspective, this cutover is seamless. Messages simply start appearing with the new structure, but the topic name hasn’t changed, the subscription hasn’t changed, and there’s been no gap in message availability. If you’ve managed your schema evolution properly (perhaps using Schema Registry with backward compatibility), consumers can even handle the new message structure without code changes.
The rollback story is equally compelling. Suppose you deploy the new topology and discover a critical bug in production. With traditional deployment approaches, rolling back is fraught with risk—you’ve already lost the old state stores, events have been processed with the new logic, and downstream consumers might have already seen the new message format. With remapping, you simply restore the old code and remove the ConfigMap entries. The new topology stops writing to stock_executed_dummy
, the old topology resumes writing to stock_executed
, and consumers never knew anything went wrong.
Technical Architecture Considerations
Implementing remapping effectively requires attention to several technical details that can make or break your migration strategy. The ConfigMap structure needs to be carefully designed to support both simple one-to-one topic mappings and more complex scenarios like topic prefixing or pattern-based remapping.
A typical ConfigMap entry might look like this:
kafka.topic.remap.stock_executed: stock_executed_dummy
kafka.topic.remap.stock_executed.enabled: true
The .enabled
flag provides a clean way to disable remapping without removing the configuration entirely, which is useful during rollback scenarios. Some implementations support more sophisticated remapping rules, such as applying a prefix to all remapped topics (stock_executed
becomes v2_stock_executed
) or using regular expressions to remap multiple topics with a single rule.
Integration with Kafka Streams requires hooking into the topology construction process. The cleanest approach is to wrap the StreamsBuilder
or intercept calls to methods like stream()
, table()
, and to()
that reference topic names. The wrapper checks each topic name against the remapping configuration and substitutes the physical name if a mapping exists.
State store naming deserves special attention because Kafka Streams generates state store names based on the topology structure. When you change from a DSL join to Processor API with custom state stores, the naming scheme changes. If you’re not careful, redeploying with remapping active could trigger a complete state store rebuild even though you don’t want one.
The solution is to explicitly name your state stores in both the old and new topologies. In the DSL, you can provide names to operations like groupBy()
and join()
using the Named
parameter. In the Processor API, you directly specify state store names when you register them. By keeping these names consistent across topology versions, you ensure that state stores are only rebuilt when you explicitly want them to be.
One subtlety that catches teams out is the interaction between remapping and Kafka Streams’ internal topics. When you use operations like repartition()
or maintain changelog topics for state stores, Kafka Streams creates internal topics with auto-generated names. These internal topics aren’t typically remapped because they’re not part of your external contract. However, if your new topology changes the structure in a way that affects internal topics, you need to ensure the old internal topics are properly cleaned up after migration to avoid wasting storage.
The Production Rollout Process: A Step-by-Step Strategy
Phase 1: Preparation and Traffic Control
The foundation of a successful topology migration is meticulous preparation. Before making any changes to production systems, the OSO engineers established a clean baseline state that would make it possible to validate each step of the migration process.
The first critical step is gracefully shutting down all upstream producers. This might seem counterintuitive—why stop new data from flowing when the whole point is zero downtime? The answer lies in establishing a clear cutoff point that makes validation possible. By temporarily stopping producers, you create a moment where you know exactly what data should be in the system. Every input topic has a defined high-water mark, and every output topic should reflect the processing of all messages up to that point.
During this producer shutdown, it’s essential to verify that lag reaches zero on both input and output topics. Input lag reaching zero confirms that your current topology has consumed every message that was produced before the shutdown. Output lag reaching zero (from the downstream consumers’ perspective) confirms that those consumers have processed everything your topology produced. This gives you a clean slate for the migration.
The lag verification needs to be thorough. It’s not sufficient to check only the primary input topics—you need to verify internal repartition topics, changelog topics for state stores, and any intermediate topics that might be part of your topology. A common mistake is declaring the system ready for migration whilst an internal changelog topic still has significant lag, leading to incomplete state store population in the new topology.
Once you’ve confirmed zero lag across the board, the next step is the offset reset strategy. Because the new topology uses different processing logic (transitioning from a two-event join to a three-event merge with state stores), it needs to reprocess historical data to build its state correctly. The offset reset period must match the temporal window of your processing logic.
In the stock execution generator case, the original topology used a three-month windowed join. This meant that events could legally arrive up to three months apart and still need to be joined together. Therefore, the new topology’s state stores needed to contain three months of historical context to make correct processing decisions. The offset reset was configured to go back precisely three months, ensuring the state stores would be fully populated with all relevant historical events.
This offset reset is where remapping begins to show its value. Because the new topology will write to stock_executed_dummy
rather than stock_executed
, you can safely reprocess three months of historical data without producing duplicate messages in the production topic. Downstream consumers continue reading from stock_executed
, which receives no new messages during this state store population phase, whilst the new topology churns through historical data in isolation.
Phase 2: Deployment with Remapping Active
With preparation complete, the actual deployment of the new topology becomes surprisingly straightforward. The ConfigMap is updated first, before any code deployment occurs, adding the remapping configuration that redirects stock_executed
to stock_executed_dummy
. This ordering is crucial—if you deployed the new code without the remapping configuration in place, even briefly, it would start writing to the production topic before its state stores were ready.
The ConfigMap update in a Kubernetes environment typically looks like this:
apiVersion: v1
kind: ConfigMap
metadata:
name: stock-execution-generator-config
data:
kafka.topic.remap.stock_executed: "stock_executed_dummy"
kafka.offset.reset: "earliest"
kafka.offset.reset.timestamp: "2024-07-01T00:00:00Z"
Once the ConfigMap is updated and synced across your Kubernetes cluster, the new version of the stock execution generator service is deployed. The deployment should use a rolling update strategy, but because this is effectively a new application.id from Kafka Streams’ perspective (or you’ve explicitly reset offsets), the behaviour is more like a fresh start than a rolling upgrade.
As the new service instances start up, they begin consuming from the input topics starting from three months ago. This is where the Processor API implementation begins building its state stores, accumulating the historical context needed for correct three-event merging. The service processes events and writes results to stock_executed_dummy
, which is being created for the first time as messages arrive.
Monitoring during this phase is critical. The OSO engineers tracked several key metrics to ensure the state store population was progressing correctly. Consumer lag on the input topics shows how quickly historical data is being reprocessed. The lag should decrease steadily—if it stalls or increases, something is wrong with the processing logic or resource allocation. State store metrics reveal the size and entry count of the stores, which should grow predictably as historical events are processed.
The duration of this phase depends on several factors: the volume of historical data being reprocessed, the complexity of the new processing logic, and the resources allocated to the service. For a three-month window with high-throughput topics, it might take several hours for the lag to reach zero. This is perfectly acceptable because the remapping isolation means this reprocessing has no impact on production traffic.
One critical decision point is determining when the state stores are “ready”. Simply waiting for input lag to reach zero isn’t always sufficient. The state stores need to contain not just the raw events but also the processed state that represents the join conditions, aggregations, or other stateful computations your new logic performs. The OSO engineers validated readiness by examining the contents of stock_executed_dummy
, comparing message rates and content patterns against the production stock_executed
topic to confirm the new topology was producing equivalent results.
Phase 3: The Cutover – Removing the Remapping
The cutover phase is where remapping truly shines. After validating that the new topology is working correctly, producing expected output to stock_executed_dummy
, and maintaining state stores with complete historical context, you’re ready to flip the switch and direct output to the production topic.
The process is deliberately simple to minimise the window of change. First, the stock execution generator service is shut down cleanly. Kafka Streams’ shutdown hooks ensure that any in-flight processing completes, consumer offsets are committed, and state stores are flushed to their changelog topics. This clean shutdown is essential—an abrupt termination could leave inconsistent state or uncommitted offsets that would cause problems during restart.
With the service stopped, the ConfigMap is updated to remove the remapping configuration. The line kafka.topic.remap.stock_executed: "stock_executed_dummy"
is simply deleted. Some implementations prefer to set an explicit flag like kafka.topic.remap.stock_executed.enabled: false
to make the change more visible in version control diffs, but the effect is the same—the next time the service starts, remapping will be disabled.
The service is then redeployed. Because the state stores were properly maintained during the previous phase (thanks to Kafka Streams’ changelog topics providing fault tolerance), they’re immediately available when the service restarts. The new Processor API topology instantiates its state stores, finds them already populated with three months of historical data, and begins processing new events from the input topics.
Critically, the service now writes to stock_executed
directly. The first few messages that arrive at this topic after the cutover reflect the new three-event merge logic, including the stock reservation cancellation events that drove this entire migration. Downstream consumers, which have been patiently reading from stock_executed
throughout the entire migration process, now start seeing these new message structures.
From the consumers’ perspective, nothing dramatic has happened. There was no gap in message availability—the service was only stopped for the brief moment required to update configuration and restart, typically measured in seconds. There were no duplicate messages because the remapped output went to a completely separate topic. There’s just a smooth transition from messages produced by the old topology to messages produced by the new one.
The lag monitoring during cutover deserves special attention. You should see input lag begin decreasing again as the service resumes processing the backlog that built up during the brief shutdown. Output lag (from downstream consumers’ perspective) should remain stable or decrease, confirming that consumers are successfully processing the new message format. Any unexpected increase in consumer lag suggests a compatibility issue that needs immediate attention.
Phase 4: Post-Migration Cleanup
The final phase of the migration is cleaning up the artefacts left behind by the remapping process. The stock_executed_dummy
topic served its purpose as a safe destination for output during state store population, but it’s now consuming storage unnecessarily. Similarly, any internal topics created by the old topology (like join state store changelog topics from the DSL implementation) are no longer needed.
Before deleting anything, the OSO engineers performed a final validation period. The new topology ran in production for at least 24 hours, processing real traffic, whilst monitoring systems tracked error rates, processing latency, and downstream consumer health. Only after confirming that everything was stable did cleanup begin.
Topic deletion in Kafka is straightforward but irreversible, so it’s worth being methodical. The stock_executed_dummy
topic is deleted first using the Kafka admin tools. If you’re in an environment with topic retention policies, you might choose to let the topic age out naturally rather than explicitly deleting it, which provides an additional safety margin if you need to examine its contents later.
The old state store changelog topics require more careful handling. These topics have names like stock-execution-generator-KSTREAM-JOINSTATESTORE-0000000003-changelog
(the numbering depends on your topology structure). Before deleting them, verify that the new topology’s state stores are healthy and don’t reference these topics. A mistake here could cause the service to attempt rebuilding state stores from non-existent changelogs.
Internal repartition topics from the old topology also need cleanup. These typically have names like stock-execution-generator-KSTREAM-AGGREGATE-0000000005-repartition
. Again, verify that the new topology isn’t using these topics before deletion. The safest approach is to use Kafka’s topic listing filtered by your application ID and compare against the list of topics the new topology actually uses (which you can obtain from Topology.describe()
).
State store directories on disk may also require attention, though Kafka Streams generally handles this automatically. If you explicitly changed the application ID during migration, the old state stores might persist in the local state directory. These can be safely deleted once you’ve confirmed the new topology is stable, freeing up disk space.
Documentation is the final, often-overlooked cleanup task. The OSO engineers updated their topology diagrams, configuration documentation, and runbooks to reflect the new Processor API implementation and the three-event merge logic. This is crucial for future engineers (including your future self) who need to understand how the system works or plan the next migration.
Beyond Topology Changes: Other Remapping Use Cases
Emergency Data Reprocessing in Production
Whilst planned topology migrations are the most obvious use case for remapping, the OSO engineers discovered that remapping’s most valuable application often comes during emergencies. When data has been processed incorrectly in production and needs selective reprocessing without a full topic replay, remapping provides an elegant solution.
The scenario plays out like this: an order processing system has a bug that causes certain edge-case orders to enter an invalid state. These orders represent delivery trucks that are now stuck, unable to proceed because the stock execution system thinks inventory has been allocated when it hasn’t actually been reserved. Customers are waiting for deliveries, and the problem needs to be fixed within hours, not days.
The traditional approach would be to fix the bug, deploy the corrected code, and then replay the entire input topic to reprocess all orders. But replaying weeks or months of data would take far too long, and it would generate duplicate processing for the vast majority of orders that were handled correctly the first time. What’s needed is selective reprocessing of only the problematic orders.
This is where remapping becomes an emergency response tool. The OSO engineers created an emergency microservice—a lightweight Kafka Streams application with a simple topology that consumes from a specially crafted topic containing only the problematic order IDs. This emergency service uses remapping to redirect its output to the production stock_executed
topic, effectively injecting corrected stock execution records into the main processing pipeline.
The workflow is straightforward: identify the problematic order IDs through log analysis or database queries, publish those IDs to an emergency reprocessing topic, configure the emergency service to consume from this topic whilst remapping its output to stock_executed
, and deploy. The emergency service processes only the specific orders that need correction, producing fixed stock execution records that downstream services consume normally.
What makes this approach powerful is its surgical precision. You’re not disrupting the main stock execution generator service, which continues processing new orders normally. You’re not replaying massive amounts of historical data. You’re simply injecting corrected records for specific cases where processing went wrong. And because remapping handles the output topic configuration, you can test the emergency service thoroughly by directing output to a test topic before flipping the remapping configuration to target production.
The OSO engineers used this technique multiple times to “unblock” stuck delivery trucks—hence the comment in the presentation about it saving their skin in production. When a dozen trucks are sitting idle because of a software bug, the ability to reprocess just those specific orders within minutes rather than hours can be the difference between meeting delivery commitments and failing customers.
State Store Type Migrations
Another valuable application of remapping is changing state store implementations without service interruption. Kafka Streams supports multiple state store types—RocksDB for persistent storage, in-memory stores for performance, and custom implementations for specialised requirements. Each has different performance characteristics, memory requirements, and operational considerations.
Suppose you’ve been using RocksDB state stores but profiling reveals that your working set fits comfortably in memory and the disk I/O from RocksDB is your bottleneck. Switching to in-memory stores could dramatically improve throughput, but how do you make this change safely in production?
The remapping strategy applies directly. Deploy a new version of your service configured with in-memory state stores and remapping enabled so output goes to a dummy topic. The service consumes historical data to populate its in-memory stores (much faster than RocksDB would populate from disk), and you validate that performance and correctness meet expectations. Once validated, remove remapping and redeploy so the service writes to the production topic.
The reverse scenario is equally applicable. If your in-memory state stores have grown beyond available RAM and you’re experiencing out-of-memory errors, you need to migrate to RocksDB’s disk-backed storage. Remapping lets you deploy the RocksDB version, populate its state stores while writing to a dummy topic, validate that disk I/O doesn’t create unacceptable latency, and only then cut over to production.
State store migrations benefit from remapping’s isolation in a specific way: they give you the opportunity to tune state store configuration in a production-like environment before committing to the change. You can experiment with RocksDB’s block cache size, compression settings, and write buffer configuration whilst processing real production data volumes, all while remapping ensures that any mistakes in tuning don’t affect downstream consumers.
Testing New Business Logic Safely
Development and testing environments, no matter how carefully constructed, never perfectly replicate production conditions. Data volumes differ, traffic patterns vary, and rare edge cases that appear in production logs never manifest in test environments. When you’re introducing significant new business logic—like the three-event merge in the stock execution generator—you’d ideally like to run it against production data before committing fully.
Remapping enables a form of shadow testing where new topology versions run in the production environment, consuming real production data, but writing to isolated output topics that aren’t consumed by downstream services. This gives you high confidence that the new logic handles production conditions correctly before exposing it to consumers.
The OSO engineers used this pattern for A/B testing stream processing logic. The old topology continues writing to stock_executed
whilst a new experimental topology (with remapping) writes to stock_executed_experimental
. Both consume the same input events in real-time. Monitoring systems compare the outputs, flagging discrepancies that indicate bugs or logic errors in the experimental version.
This approach provides several advantages over traditional staged rollouts. You’re testing with actual production data volumes and patterns, not synthetic test data. You can run the comparison for days or weeks, accumulating confidence that the new logic handles not just normal cases but also the rare edge cases that only appear occasionally. And you can abort the experiment at any time without affecting production—just shut down the experimental service and delete its output topic.
For gradual rollouts, remapping enables traffic splitting. Configure your service instances with different remapping rules—some write to the production topic, others to an experimental topic—and gradually shift the ratio as confidence grows. This is more flexible than Kafka Streams’ built-in standby replicas because you control exactly which instances participate in the experiment and can adjust the ratio dynamically through configuration changes rather than redeployments.
Practical Takeaways: Implementing Remapping in Your Environment
Prerequisites for Successful Remapping
Before implementing remapping in your organisation, certain foundational capabilities need to be in place. These prerequisites aren’t unique to remapping—they’re generally good practices for operating event-driven architectures at scale—but remapping makes them essential rather than optional.
Configuration management infrastructure is the first requirement. Whether you’re using Kubernetes ConfigMaps, HashiCorp Consul, AWS Systems Manager Parameter Store, or even simple property files, you need a reliable way to manage environment-specific settings without recompiling or repackaging applications. The configuration system should support versioning (so you can track which remapping rules were active at any point in time), validation (to catch typos in topic names before they cause production incidents), and ideally hot-reloading (though remapping configuration changes typically require service restarts to take effect safely).
Monitoring infrastructure must provide comprehensive visibility into Kafka Streams internals. Consumer lag monitoring is essential—you need to track lag on input topics, output topics, and internal repartition and changelog topics. State store metrics reveal the health and size of your stores, critical for knowing when state population is complete during migrations. Throughput metrics on both the consumer and producer sides help you identify performance bottlenecks. And custom application metrics should track remapping state explicitly, making it obvious from dashboards whether remapping is currently active.
Deployment orchestration capabilities directly impact how smoothly remapping-based migrations execute. Zero-downtime deployments with health checks ensure that new service versions only begin receiving traffic after state stores are loaded and ready. Configuration injection must happen before application startup so remapping rules are in place from the moment Kafka Streams initialises. Rolling deployment strategies prevent the entire service from going offline simultaneously, though for topology migrations you often want all instances to cut over together to avoid split-brain scenarios where different instances are running different topologies.
The operational maturity of your team matters as much as technical infrastructure. Remapping-based migrations require careful planning, methodical execution, and close monitoring. You need runbooks that document the step-by-step process, rollback procedures that can be executed under pressure, and communication protocols for coordinating changes that affect multiple teams (even if those teams don’t need to change their own services, they should be aware that topology changes are in progress).
Key Implementation Decisions
Several design decisions will shape how remapping works in your specific environment. These decisions have trade-offs, and the right choice depends on your operational model, team structure, and risk tolerance.
Remapping granularity is the first decision: do you implement remapping at the application level or the topic level? Application-level remapping means each microservice has a single toggle that activates remapping for all its output topics simultaneously. This is simpler to manage and reason about—either remapping is active or it isn’t—but it’s less flexible. Topic-level remapping lets you selectively remap individual topics whilst leaving others unchanged, useful for complex migrations affecting only part of a service’s output, but it increases configuration complexity and the cognitive load of understanding which topics are remapped at any given moment.
State store strategies require planning around offset reset periods and state store population times. The offset reset period must cover the temporal window of your processing logic—if you have windowed joins or time-based aggregations, you need to reprocess enough historical data to fully populate the state stores. But longer offset reset periods mean longer state store population times, delaying when you can cut over to the new topology. The OSO engineers used three months because their windowed join required it, but many use cases can work with much shorter windows—days or even hours—if the processing logic doesn’t require extensive historical context.
Rollback procedures need to be documented and tested before you need them. When remapping is active and something goes wrong, what’s your rollback path? The simple answer is “remove the remapping configuration and redeploy the previous version”, but edge cases complicate matters. What if the new topology has already written messages to the production topic before you caught the problem? What if downstream consumers have already processed those messages and taken actions based on them? What if the old state stores have been deleted? Each scenario needs a documented procedure so on-call engineers can respond effectively under pressure.
Common pitfalls to avoid become obvious with experience, but learning them the hard way in production is painful. Forgotten cleanup—leaving dummy topics and old state stores around—is perhaps the most common mistake. These artefacts consume storage and create confusion for engineers examining the Kafka cluster months later. Automation helps: scripts that list all topics matching your application ID pattern and compare against what the current topology actually uses can identify orphaned topics for cleanup.
Insufficient offset reset periods cause subtle bugs that only manifest after migration. If you reset offsets back one month but your windowed join has a three-month window, the state stores will be missing historical context for events older than one month. Processing might appear to work initially, but certain edge cases will fail because expected joined events aren’t in the state store. The symptom—missing or incorrect output records—is obvious, but the cause can be difficult to diagnose after the fact.
Downstream schema changes represent a category error that remapping doesn’t solve. Remapping handles topology changes—modifications to how you process events internally—but it doesn’t protect downstream consumers from incompatible schema changes in the messages themselves. If your three-event merge produces messages with a different structure than the old two-event join, remapping won’t help consumers handle the new format. Schema evolution must be addressed separately, typically through backward-compatible schema design and Schema Registry compatibility enforcement.
Monitoring gaps during migration can hide problems until after cutover when they’re harder to fix. The dummy topic during remapping should be monitored just as carefully as the production topic—message rates, schema validation errors, and processing latency all need dashboards. If you’re not watching the dummy topic closely, you might miss signs that the new topology isn’t working correctly until you’ve already cut over to production.
Integration with Existing Tools
Remapping integrates cleanly with standard Kafka ecosystem tools, but understanding the integration points helps you leverage existing investments rather than building everything from scratch.
Spring Boot applications with Spring Kafka provide natural integration points for remapping. Spring’s externalised configuration model aligns perfectly with remapping needs—topic names can be injected from application.properties or application.yml files, which themselves can be overridden by environment variables or ConfigMaps in Kubernetes. A simple Spring configuration class can read remapping rules from properties and apply them transparently before creating Kafka Streams topologies.
Non-Kubernetes environments don’t lose the benefits of remapping—the technique is platform-agnostic. Traditional VM deployments can use configuration management tools like Ansible or Puppet to push updated configuration files during migrations. Cloud platforms like AWS can leverage Systems Manager Parameter Store or Secrets Manager for centralised configuration. The key requirement is simply the ability to change configuration and restart services in a controlled manner.
Open source implementations of remapping are available in various forms. Some organisations have published their internal remapping libraries, providing reference implementations you can adopt or adapt. The OSO engineers mentioned their open-source library specifically for this purpose. Examining existing implementations provides valuable insights into edge cases and design patterns that might not be obvious when building from scratch.
Commercial Kafka Streams management platforms increasingly include remapping-like capabilities, though they may use different terminology. Some call it “topology versioning”, others “blue-green deployments for streams”. The underlying principle—decoupling internal processing logic from external topic contracts—remains the same regardless of naming. When evaluating these platforms, look for support for the key remapping workflows: deploying new topologies with isolated output topics, validating correctness before cutover, and quick rollback if problems are discovered.
Lessons from Operating at Scale: 250 Microservices and Beyond
Architectural Patterns That Enable Safe Changes
The scale at which the OSO engineers operated—250+ microservices and 1000+ topics—reveals patterns that make remapping-based migrations not just possible but routine. These patterns emerged from years of operating massive event-driven architectures and represent hard-won lessons from production incidents.
Topic-to-topic communication as an architectural principle means microservices consume from topics and produce to topics, with no direct service-to-service calls. This creates natural boundaries for change. When one service needs to modify its internal processing logic, the impact is limited to the topics it produces—consumers don’t care how those topics are generated, only that they contain the data they need. Remapping reinforces this principle by making it safe to change processing logic whilst maintaining topic-level contracts.
Declarative topology definition pushes configuration out of code where possible. While the core processing logic necessarily lives in code, operational concerns like topic names, serialisation formats, and resource allocation can be externalised to configuration. This makes remapping natural—topic names were already configured rather than hardcoded, so adding remapping rules is a small incremental change rather than a fundamental redesign.
Interactive queries for state visibility become essential at scale. During a remapping-based migration, you need to know what’s in your state stores. Are they fully populated? Do they contain the expected data? Are join conditions being satisfied correctly? Kafka Streams’ interactive queries let you programmatically inspect state store contents, enabling automated validation during migrations rather than relying on manual sampling or hoping for the best.
The OSO engineers exposed HTTP endpoints on their Kafka Streams services specifically for state store inspection. During the stock execution generator migration, monitoring systems could query these endpoints to confirm that the new state stores contained stock reservation records, order records, and delivery booking records in the expected ratios. This automated validation provided confidence that the migration was proceeding correctly without requiring engineers to manually examine topic contents.
Monitoring and Observability Requirements
Operating event-driven architectures at scale demands observability that goes far beyond basic metrics. The monitoring infrastructure the OSO engineers relied on illustrates what’s necessary for safe topology migrations in production environments.
Multi-layered metrics provide different views of system health. Prometheus metrics collect high-frequency quantitative data—message rates, processing latencies, state store sizes, and consumer lag. These metrics feed into Grafana dashboards that provide real-time visibility during migrations. Splunk log aggregation captures qualitative information—error messages, warnings about unusual conditions, and traces of individual message processing. Together, these layers enable both broad system-level monitoring and deep diagnostic investigation when problems occur.
Dead letter queue patterns separate different types of failures and enable targeted recovery. The OSO engineers implemented DLQs for business errors (like orders with invalid data) separately from technical exceptions (like serialisation failures or network timeouts). This separation matters during migrations because business errors might indicate that your new processing logic handles edge cases differently than the old logic, whilst technical exceptions suggest infrastructure or code problems.
During the stock execution generator migration, monitoring DLQs for unusual patterns was critical. An increase in business errors to the DLQ after deploying the new topology might indicate that the three-event merge logic was rejecting certain combinations of events that the old two-event join accepted. This would warrant investigation and possibly rollback, even if the technical metrics looked healthy.
24/7 on-call capabilities aren’t optional when operating at this scale. Topology migrations, even with remapping to reduce risk, still represent significant changes to production systems. The OSO engineers scheduled migrations during business hours when full engineering teams were available, but maintained on-call coverage to respond if something went wrong. The on-call rotation had access to runbooks detailing rollback procedures, escalation paths, and contact information for downstream team leads who might need to be notified if consumers experienced problems.
Tracing Challenges with Processor API
Moving from Kafka Streams DSL to the Processor API introduces specific observability challenges that the OSO engineers encountered during their migration. Understanding these challenges helps you plan for them rather than discovering them in production.
OpenTelemetry integration difficulties stem from how the Processor API manages state. When you use the DSL, Kafka Streams handles a lot of complexity behind the scenes, including maintaining trace context as messages flow through the topology. When you drop to the Processor API and explicitly manage state stores, that automatic trace propagation breaks down.
The specific problem occurs when records are stored in state stores. A record arrives with trace context in its headers, gets processed by your Processor implementation, and is then stored in a state store to await additional events (like the other legs of a three-event join). When the record is eventually retrieved from the state store and processed further, the original trace context might be lost because state store serialisation doesn’t automatically preserve headers.
Workarounds exist but require explicit code. The OSO engineers embedded correlation IDs in message headers and ensured those headers were preserved through state store serialisation. Custom extractors and injectors for trace context can be implemented, though this requires understanding OpenTelemetry internals. Some teams abandon distributed tracing entirely for Processor API implementations and rely instead on correlation ID logging and log aggregation to reconstruct message flow after the fact.
This is an active area of development in the Kafka Streams community. Future versions may provide better tracing support for Processor API implementations. In the meantime, plan for reduced observability compared to DSL-based topologies and compensate with comprehensive logging and correlation ID strategies.
Conclusion: Making Production Changes Predictable and Safe
The remapping strategy transforms one of Kafka Streams’ most challenging scenarios—modifying production topologies—into a predictable, low-risk operation. By decoupling internal processing logic from external topic contracts, teams gain the flexibility to evolve their architectures without the constant fear of breaking downstream dependencies.
The key insight is that most “big bang” deployment risks stem from attempting to change too many things simultaneously. Code, state management, processing logic, and consumer contracts all change at once, creating massive blast radius for failure. Remapping allows you to decompose a risky migration into discrete, reversible steps: first build the new topology safely in isolation, validate it thoroughly whilst processing real production data, then cut over to it only after confirmation that everything works correctly, and finally clean up the old resources.
For teams operating event-driven architectures at scale, implementing remapping capabilities should be considered essential infrastructure—not just for planned migrations, but as a critical tool for emergency production recovery. As the OSO engineers discovered, the technique that enables planned topology changes also becomes invaluable when selective data reprocessing is the only thing standing between you and a major incident at 2 AM.
The investment required to implement remapping is modest compared to the value it provides. A configuration management system, monitoring infrastructure, and deployment orchestration capabilities are all standard requirements for operating production Kafka Streams services anyway. Remapping simply adds a thin abstraction layer that makes those existing capabilities dramatically more powerful for managing change.
Whether you implement remapping from scratch or adopt existing open-source solutions, the payoff comes the first time you need to fundamentally change a production stream topology. The alternative—extended downtime, complex coordination across teams, or accepting the limitations of your initial design choices—simply doesn’t scale in modern event-driven systems where business requirements evolve rapidly and architectural flexibility is a competitive advantage.
The stock execution generator migration illustrates the pattern: from a two-event join using DSL to a three-event merge with Processor API and custom state stores, all whilst maintaining zero downtime and protecting downstream consumers. This same pattern applies whether you’re changing state store types, introducing new business logic, optimising performance, or recovering from production incidents.
Remapping isn’t magic—it doesn’t eliminate the complexity of distributed stream processing or make topology changes trivial. What it does is make that complexity manageable by providing isolation boundaries, validation opportunities, and rollback capabilities that dramatically reduce deployment risk. In an industry where production incidents from botched deployments can cost millions in lost revenue and customer trust, that risk reduction is invaluable.
As event-driven architectures continue to grow in scale and complexity, the techniques that enable safe change will become increasingly critical. Remapping represents one such technique, proven in production at massive scale, and available to any team willing to invest in the foundational capabilities that make it possible. The question isn’t whether you’ll need to change production topologies—that’s inevitable as business requirements evolve. The question is whether you’ll have the tools and techniques in place to do it safely when the time comes.
This post first appeared on Read More