Building Multi-Region Orchestration with Apache Kafka: A Pull-Based Architecture

Modern enterprise applications increasingly require orchestration of long-running workflows that span multiple services across different geographical regions. Traditional HTTP-based orchestration approaches, whilst familiar, introduce significant coupling between applications and services, create back pressure challenges, and complicate service discovery in distributed environments.

The Challenge: How do you orchestrate workflows involving 100+ steps across stateful services deployed in multiple regions, whilst maintaining loose coupling, resilience, and the flexibility to refactor services without breaking dependent applications?

The Solution: The OSO engineers developed a Kafka-based orchestration platform that fundamentally reimagines how distributed workflows are coordinated. By replacing push-based HTTP communication with pull-based Kafka messaging and introducing novel abstractions like dynamic shard registries and recipe definitions, they achieved not only architectural decoupling but also 2x performance improvements—without changing the underlying data movement mechanisms.

This article explores the architecture, design decisions, and technical implementation of this Kafka-based orchestrator, providing insights for engineering teams grappling with similar challenges in multi-region, stateful service orchestration.

The Limitations of Traditional HTTP-Based Orchestration

Why HTTP Orchestration Falls Short at Scale

Tight Coupling Between Applications and Services: In traditional HTTP-based systems, user-facing applications needed explicit knowledge of backend service endpoints, load balancers, and network locations. Any service refactoring or re-architecture required changes to all calling applications, creating a brittle dependency graph that inhibited evolution. When a team wanted to break down a monolith into microservices or shift a service to a new regional deployment, every dependent application needed updating. This tight coupling turned routine architectural improvements into coordination nightmares spanning multiple teams and release cycles.

The Kitchen Sink Anti-Pattern: Workflow definitions contained both the logical structure—what steps to execute—and the implementation details—which services to call, where they’re located, and how to handle failures. This violated separation of concerns and made workflows difficult to maintain and evolve independently. A simple change to add a validation step or reorder operations required diving into code that also managed HTTP timeouts, retry logic, and circuit breaker patterns. The workflow logic and infrastructure concerns were tangled together in an unmaintainable mess.

Back Pressure and Rate Limiting Complexity: In push-based architectures, the caller must handle scenarios where services cannot accept load. Applications needed to implement sidecar logic, retry mechanisms, and handle various HTTP error codes. When Service A pushed work to Service B via HTTP POST, and Service B was under heavy load, Service A had to manage the 429 Too Many Requests responses, implement exponential backoff, maintain a local queue of failed requests, and potentially shed load to prevent cascading failures. This complexity multiplied across every service-to-service interaction in the system.

Service Discovery and Multi-Region Challenges

Static Network Location Registration: Traditional orchestrators required explicit registration of service network locations. When services deployed to new regions or underwent re-architecture, orchestrators needed manual reconfiguration, creating operational overhead and deployment dependencies. The service registry became a bottleneck. Want to add capacity in a new region? First, update the orchestrator configuration. Want to test a canary deployment? Update the configuration again. Every infrastructure change required touching the orchestration layer.

Regional Data Locality Requirements: For stateful services where customer data must remain in specific regions for compliance—GDPR, data residency laws, financial regulations—HTTP-based orchestration struggled to route requests appropriately. Applications needed region-aware logic, further increasing coupling. If a customer’s data resided in the European Union, the workflow had to ensure that processing steps touching that data executed only on services deployed in EU regions. Encoding this logic into every application created duplicated complexity and increased the risk of compliance violations.

Redeployment Brittleness: Service redeployments or regional failures caused workflow interruptions. The lack of dynamic service health awareness meant orchestrators couldn’t adapt to changing infrastructure conditions in real-time. When operations teams needed to deploy a critical security patch during business hours, they faced a choice: interrupt running workflows or delay the deployment. The orchestrator had no mechanism to pause workflows, allow services to drain, redeploy, and resume processing seamlessly.

Kafka as an Orchestration Platform: Architectural Foundations

The Paradigm Shift: From Push to Pull

Inverting the Control Flow: Rather than applications pushing commands to services via HTTP, services pull work from Kafka topics. This inversion gives services control over their consumption rate, naturally handling back pressure and enabling each service to process at its optimal throughput without overwhelming downstream systems. A service experiencing high load simply slows its consumption rate from the Kafka topic. The orchestrator doesn’t need to implement retry logic or respect rate limits—the service consumes work as fast as it can handle, and unprocessed work waits safely in Kafka’s durable log.

Decoupling Applications from Service Implementation: Applications express workflow logic in high-level abstractions called recipes without knowing which services will execute steps or where those services are deployed. The orchestrator resolves execution at runtime based on current service availability and capability declarations. When an application defines a workflow with a “validate_customer_data” step, it doesn’t specify that Service X in Region Y should handle it. Instead, any service that’s registered the capability to execute “validate_customer_data” can pick up the work. This decoupling means services can be refactored, redeployed, or moved to new regions without impacting application code.

Topic-Based Communication Patterns: Inter-service communication occurs entirely through Kafka topics. Request topics carry commands from the control plane to services; response topics carry execution results back. This eliminates direct service-to-service HTTP calls and centralises message routing through Kafka’s robust delivery guarantees. The orchestrator publishes a step execution request to a topic. Services subscribed to that topic pull the request, process it, and publish the result to a response topic. The orchestrator consumes the response and determines the next step. No service ever needs to know the network location of another service.

Core Components of the Architecture

Recipe Abstraction: A declarative workflow definition language that describes the logical steps, conditional branches, and loop structures without specifying implementation details. Recipes are higher-level than traditional workflows, with first-class support for parallel processing, item enumeration in loops, and worker pool management. In the OSO implementation, recipes are expressed as YAML or Kotlin DSL. A recipe might define: run pre-checks, if successful then conditionally migrate users, set up the destination environment, run ETL jobs in parallel for each data partition, upload attachments, and finally validate the migration. Nowhere in this definition does it specify which service handles each step or where those services run.

Dynamic Shard Registry: A real-time registry tracking which service shards are available in which regions, what capabilities—logic identifiers—they can execute, and their current health status. Unlike static service discovery, this registry provides a dynamic view that updates as services scale, deploy to new regions, or experience degradation. When a service starts, it connects to the orchestrator and declares: “I am Service Alpha, running in Region EU-West, and I can execute these five logic identifiers.” The service sends periodic heartbeats indicating not just that it’s alive, but that it’s healthy and ready to process work. If the service becomes unhealthy—perhaps a dependent database is slow—it can signal degraded health, and the orchestrator routes work elsewhere.

Control Plane Service: The central orchestration brain deployed in each region. Control planes communicate via a message relay service to coordinate cross-region workflows whilst maintaining data locality. Each control plane maintains Kafka Streams topologies for managing domain entities: recipe executions, step executions, and buckets. When a user initiates a workflow in Region A, the local control plane creates a recipe execution entity. As the workflow progresses and requires a step to execute in Region B—perhaps to access stateful data that must remain in that region—the control plane in Region A coordinates with the control plane in Region B via the message relay service to execute that step.

Why Kafka’s Guarantees Matter for Orchestration

Durable Message Storage: Kafka’s log-based storage ensures workflow commands and state transitions persist even during service outages. Services can be offline during redeployment and resume processing from their last committed offset without losing work. In traditional HTTP orchestration, if a service was down when a request arrived, the request either failed immediately or the caller had to implement complex retry logic with persistent queues. With Kafka, the request simply waits in the topic. When the service comes back online after a deployment, it resumes consuming from where it left off.

Exactly-Once Semantics: For critical workflow steps, Kafka’s exactly-once processing guarantees prevent duplicate execution of non-idempotent operations, essential for financial transactions, data migrations, or state-changing operations. If a service processes a “charge customer £1,000” command and crashes before committing its offset, the exactly-once guarantee ensures the command isn’t reprocessed when the service restarts. The customer is charged once, not twice.

Ordering Guarantees Within Partitions: By partitioning topics by workflow execution ID, all steps for a given workflow maintain order, ensuring state transitions occur correctly even in distributed processing environments. A workflow with steps A, B, and C always processes in that order for a specific execution. Even though thousands of workflow instances might be executing concurrently across different partitions, each individual workflow progresses through its steps in the correct sequence.

SDK-Based Integration: Type Safety and Developer Experience

The Case for SDKs Over HTTP APIs

Type Safety and Compile-Time Validation: By providing language-specific SDKs—initially Java and Kotlin—services gain compile-time checks for request and response contracts. Workflow step definitions become strongly typed interfaces rather than stringly-typed HTTP endpoints, catching integration errors before deployment. When a developer implements a step executor, they implement a typed interface. The request object has defined fields with specific types. If the workflow definition changes and adds a required field, services that haven’t updated their implementation fail to compile rather than failing at runtime in production.

Simplified Integration Path: Services add the SDK to their build, implement executor interfaces—ShortCommandExecutor for quick operations, LongCommandExecutor for multi-hour processes—and register their capabilities. The SDK handles all Kafka consumer and producer logic, heartbeats, and orchestrator communication protocols. From a service developer’s perspective, integrating with the orchestrator is similar to handling HTTP requests. They implement a method that takes a request object and returns a response object. The difference is they don’t write any HTTP handling code, configure endpoints, or manage load balancers. The SDK manages all the distributed systems complexity.

Abstraction of Distribution Complexity: The SDK manages complexities like consumer group coordination, offset management, and handling scenarios where different service instances process request and response in long-running operations. Service developers focus solely on business logic. Consider a long-running operation that takes several hours. The service instance that receives the initial request might not be the same instance that completes the work. The SDK handles the coordination, ensuring the response makes it back to the orchestrator regardless of which instance processes which phase of the operation.

The Trade-Offs and Considerations

Language Ecosystem Lock-In: Each programming language requires a separate SDK implementation. Whilst Java and Kotlin services gain immediate benefits, expanding to Go, TypeScript, or Python services requires additional SDK development effort and ongoing maintenance across language ecosystems. The OSO engineers chose to start with JVM languages because they represented the majority of services in the target architecture. Adding support for other languages means replicating the SDK’s functionality—consumer management, heartbeating, registration protocols—in each language, which requires significant engineering investment.

Tighter Coupling to Platform: Services become coupled to the SDK’s lifecycle and version upgrades. However, this coupling is at the framework level rather than business logic level, and the benefits of type safety and simplified integration often outweigh the dependency management overhead. When the orchestrator team wants to add a new feature—perhaps distributed tracing or enhanced telemetry—they release a new SDK version. Services need to upgrade their dependency and potentially update their code. This is similar to upgrading any framework dependency, but it does create a coordination requirement across teams.

Internal Service Communication Patterns: The SDK approach works exceptionally well for internal service meshes where the organisation controls all services. For external API exposure or third-party integrations, traditional HTTP APIs remain necessary, requiring dual communication patterns. Services end up supporting two modes of operation: pulling work from the orchestrator via the SDK for internal workflows, and exposing HTTP endpoints for external clients. This isn’t necessarily problematic—many services already expose both synchronous APIs and asynchronous message handlers—but it does add to the complexity of service architecture.

Kafka Streams for State Management: Event Sourcing at Scale

Event Sourcing with Kafka Streams Topologies

Domain Entity Lifecycle Management: The control plane manages entities like RecipeExecution, StepExecution, and Bucket using Kafka Streams. Change events flow through stream topologies that fetch current entity state from state stores, apply the event transformation, persist updated state, and emit state change events for downstream processing. This is classic event sourcing. The true source of state is the sequence of events. The RecipeExecution entity doesn’t exist in a database row; it exists as the accumulation of all RecipeExecutionChangeEvents applied in sequence.

When a workflow starts, a RecipeCreated event enters the topology. The topology creates a new RecipeExecution entity with status “Running”. As steps complete, StepCompleted events flow through, updating the RecipeExecution’s progress. When all steps finish, a RecipeCompleted event transitions the entity to “Success” status. At any point, the current state of a workflow can be reconstructed by replaying its event stream.

Stateful Processing with State Stores: Kafka Streams state stores, backed by RocksDB, provide low-latency key-value storage for entity state. The aggregate operator in stream topologies handles the fetch-modify-persist cycle, ensuring consistent state evolution even as events arrive concurrently across distributed control plane instances. The elegance of this approach is that Kafka Streams manages all the complexity. The developer defines how to apply an event to an entity—a pure function—and Kafka Streams handles persistence, replication via changelog topics, and consistency across the distributed system.

Topology Design Patterns: A typical topology consumes change events, groups by key—such as recipe execution ID—aggregates to apply state transitions, streams updates as domain events, and maps to output topics. This pattern separates event ingestion, state management, and output generation into discrete, composable processing stages. The OSO engineers structured their topologies to match domain boundaries. One topology manages recipe executions, another manages step executions, a third handles the bucket abstraction for loop processing. Each topology is independently scalable and deployable.

Handling Concurrency and Parallelism

Bucket Entity for Loop Processing: When workflows contain loops processing multiple items, the Bucket entity serves as a staging area—conceptually a map of queues keyed by step name. This prevents fast-processing steps from blocking whilst slower downstream steps catch up, enabling true parallel processing of loop iterations. Imagine a data migration workflow that processes 10,000 customer records. The extraction step might process records faster than the validation step. Without buckets, the extraction step would have to wait for validation to catch up, serialising the entire process. With buckets, extracted records queue up in the bucket, and the validation step consumes them at its own pace, maximising throughput.

Separation of References and Data: To avoid exceeding Kafka’s message size limits in compacted topics—typically 1MB—bucket entities store only item references whilst actual item data resides in separate state stores. This architectural decision ensures state store fault tolerance without compromising on data volume handling. When a step produces a large output—perhaps a transformed dataset—it doesn’t write the entire dataset into the bucket entity. Instead, it writes the data to a separate state store and puts a reference in the bucket. Downstream steps read the reference from the bucket and fetch the actual data from the state store.

Persistent State Stores for Rebalancing: During consumer rebalancing—common in distributed deployments—rebuilding large in-memory state stores from changelog topics causes application pauses. By using persistent RocksDB-backed state stores, the control plane can rebuild from local disk up to a certain point, dramatically reducing rebalancing impact and maintaining workflow processing throughput. When a new control plane instance joins the cluster or an existing instance restarts, it doesn’t need to replay the entire changelog topic. It reads recent state from its local RocksDB store and only replays recent changelog entries, reducing startup time from minutes to seconds.

Message Relay Service: Cross-Cluster Communication

The Challenge of Multi-Region Kafka Communication

Why Not Mirror Maker: Whilst Apache Kafka’s Mirror Maker is excellent for replicating entire topics between clusters, it’s optimised for high-throughput data replication. For orchestration, where individual command messages—kilobytes, not gigabytes—must route dynamically between regions based on runtime decisions, Mirror Maker’s configuration model and cost structure proved inefficient. Mirror Maker is designed to replicate everything from a source cluster to a destination cluster. To support n regions with full mesh connectivity, you’d need n squared replication flows. For six regions, that’s thirty-six Mirror Maker configurations to maintain. The cost and operational overhead didn’t align with the orchestration use case.

Header-Based Message Routing: The message relay service uses Kafka headers—analogous to postal addresses on envelopes—to determine message destinations. Each message carries target cluster and target topic metadata in headers. The relay service inspects these headers and routes messages accordingly without needing pre-configured topic mirrors. When a control plane in Region A needs to send a step execution command to Region B, it publishes the message to its local topic with headers indicating the target cluster and target topic. The message relay service reads the message, inspects the headers, and routes it appropriately.

Cut and Paste Semantics: The service “cuts” messages from source topics in one region or cluster and “pastes” them into target topics in destination regions or clusters. This selective, message-level routing provides fine-grained control compared to topic-level replication tools. Only the specific messages that need cross-region routing incur network transfer costs. The majority of workflow processing—steps executing on services in the same region as the control plane—never leave the region, reducing latency and data transfer costs.

Technical Implementation with Kafka Connect

Custom Source and Sink Connectors: The OSO engineers built homegrown Kafka Connect connectors. The source connector pulls messages from source clusters and sorts them by target cluster in an intermediate “relayer cluster” local to the message relay service. The sink connector reads from this intermediate cluster and produces to target clusters. This design was chosen after evaluating existing tools and determining that none provided the required flexibility for dynamic, header-based routing with exactly-once guarantees.

The source connector acts as a smart consumer. It reads messages from a source topic, inspects the headers on each message, and writes the message to a region-specific topic in the local relayer cluster. Messages destined for Region A go to one topic, messages for Region B go to another. This sorting operation is critical for efficiency—it means the sink connectors in each region only need to consume messages relevant to their region.

Producer-Consumer Asymmetry Optimisation: Kafka producer-broker communication is chattier—more round trips—than consumer-broker communication. By always producing locally and consuming remotely, the architecture minimises cross-region network chatter, reducing latency and improving throughput for command message delivery. When a message needs to move from Region A to Region B, the relay service in Region A consumes from the local cluster—a fast operation—and produces to the local relayer cluster—also fast. The relay service in Region B consumes from the relayer cluster in Region A—crossing the network—and produces to its local cluster—fast again. Only one hop crosses the network.

Exactly-Once Guarantees for Ordering: The service provides exactly-once processing guarantees critical for maintaining message ordering. When multiple messages destined for the same target arrive interleaved, the service ensures they’re delivered in the correct order to prevent workflow state corruption. This was one of the trickiest aspects of the implementation. The OSO engineers had to carefully configure the Kafka Connect framework’s transactional settings and tune the connector’s batching behaviour to achieve exactly-once semantics across cluster boundaries whilst maintaining acceptable throughput.

Practical Takeaways for Engineering Teams

When to Consider Kafka-Based Orchestration

Multi-Region Stateful Services: If your workflows involve stateful services with data locality requirements—GDPR, compliance—deployed across multiple regions, Kafka-based orchestration naturally handles regional routing without complex HTTP-based service mesh configurations. The dynamic shard registry and control plane architecture make it straightforward to declare which services in which regions can handle which data residency requirements. Traditional orchestrators require extensive custom coding to encode these constraints.

Long-Running Workflows with 100+ Steps: Traditional orchestrators struggle with workflows spanning hours or days across dozens of services. Kafka’s durable storage and consumer offset management make workflow resumption after service redeployments trivial. When a service is redeployed mid-workflow, work waits in Kafka topics. The service restarts and continues exactly where it left off. With HTTP-based orchestration, managing state across redeployments requires sophisticated checkpointing logic in every service.

Frequent Service Refactoring: In organisations where services undergo continuous re-architecture—monolith decomposition, domain-driven design evolution—the decoupling provided by recipe abstractions and dynamic shard registries prevents workflow definition churn. Application teams define workflows once using logical step identifiers. Service teams can refactor, reorganise, or reimplement backend services without coordinating changes to workflow definitions across multiple application teams.

Implementation Considerations

Start with Control Plane and Data Plane Separation: Design your architecture with clear boundaries between workflow orchestration logic—control plane—and business logic execution—data plane. This separation enables regulatory compliance and simplifies multi-region deployments. In heavily regulated industries, you might need to ensure that customer data never leaves certain regions. With clear control plane and data plane separation, you can deploy control planes globally whilst ensuring data plane processing—where customer data is actually touched—remains regionally confined.

Invest in Dynamic Service Discovery: Static service registries create operational overhead. Build mechanisms for services to self-register capabilities, regions served, and health status. Use Kafka topics for registry updates to ensure all control plane instances have a consistent view. The health reporting aspect is particularly important. It’s not sufficient for a service to simply declare “I’m alive”. Services need to report “I’m alive and healthy and ready to accept work at full capacity” versus “I’m alive but degraded—route work elsewhere if possible”.

Evaluate SDK vs HTTP Trade-Offs: If your organisation is primarily polyglot with many language ecosystems, HTTP APIs may provide better initial reach. If you’re predominantly Java and Kotlin—or willing to invest in multi-language SDKs—the type safety and developer experience benefits are substantial. The OSO engineers made this trade-off consciously. They operated in an environment where JVM languages dominated, and the investment in a high-quality SDK paid dividends in reduced integration bugs and faster service onboarding.

Operational Lessons

Persistent State Stores Are Critical: For Kafka Streams applications managing significant state, configure persistent RocksDB-backed state stores. The rebalancing performance improvement outweighs the local disk storage cost, especially for workflows that cannot tolerate pause-the-world state rebuilding. In testing, the OSO engineers observed that large in-memory state stores could take five to ten minutes to rebuild during rebalancing, causing workflow processing to stall. Persistent stores reduced this to under thirty seconds.

Message Size Management: When using state stores with compacted changelog topics, implement reference-based patterns for large data items. Store references in the entity and actual data in separate stores to avoid hitting Kafka’s message size limits. This pattern also improves performance. Kafka is optimised for high-throughput messaging, not for moving megabyte-sized blobs. By keeping messages small and storing large data in state stores optimised for that purpose, you get better performance characteristics.

Cost Optimisation for Cross-Region Communication: Generic replication tools like Mirror Maker may introduce unnecessary costs for command-and-control messaging patterns. Consider purpose-built message relay services optimised for your specific communication patterns and cost structure. The OSO engineers calculated that using Mirror Maker would have cost approximately six times more than their custom relay service due to the overhead of replicating entire topics when only selective message routing was needed.

Conclusion: Rethinking Orchestration for Distributed Systems

The shift from HTTP-based, push-oriented orchestration to Kafka-based, pull-oriented coordination represents more than a technical implementation change—it’s a fundamental rethinking of how distributed systems should communicate in multi-region, stateful environments.

By inverting control flow and allowing services to pull work at their own pace, the OSO engineers eliminated entire classes of problems: back pressure management, service discovery brittleness, and tight coupling between applications and service implementations. The introduction of recipe abstractions created a clean separation between workflow logic and execution details, enabling services to evolve independently without breaking workflows.

Perhaps most remarkably, these architectural benefits came with tangible performance improvements—2x throughput gains in real-world workloads—demonstrating that elegant architecture and operational efficiency need not be at odds. The key enabler was Kafka’s inherent strengths: durable storage, ordering guarantees, consumer group coordination, and the Kafka Streams API for stateful processing.

For engineering teams wrestling with similar challenges—long-running workflows, multi-region deployments, frequent service refactoring, or compliance-driven data locality—this architecture provides a proven blueprint. The principles of pull-based coordination, dynamic service discovery, and Kafka Streams-based state management apply broadly across industries and use cases.

The question isn’t whether Kafka can serve as an orchestration platform—the OSO engineers have demonstrated it can. The question is whether your organisation’s orchestration challenges warrant this level of architectural sophistication. If you’re managing dozens of services across multiple regions with complex, long-running workflows, the answer is likely yes.

As organisations continue to expand globally and regulatory requirements around data residency become more stringent, the need for orchestration platforms that natively understand multi-region constraints will only grow. Kafka-based orchestration isn’t just a solution for today’s challenges—it’s an investment in architectural flexibility for tomorrow’s requirements.

This post first appeared on Read More