How to Build High-Performance Kafka Streams Applications That Scale Beyond Partition Limits
Every Kafka Streams engineer eventually hits the same wall: your application is processing one record at a time, one task at a time, and you’ve reached the maximum parallelism limit determined by your input topic partitions. The OSO engineers recently faced this exact challenge with a mission-critical deduplication application processing 50+ terabytes of data. Traditional scaling approaches—adding more nodes or increasing partitions—weren’t viable options. This is the story of how they built an asynchronous processing framework that delivered a 6x performance improvement whilst maintaining all processing guarantees.
The fundamental issue lies in Kafka Streams’ sequential processing model. Whilst this design provides excellent consistency guarantees and simplicity, it creates bottlenecks when dealing with high-latency operations or when you’ve exhausted your partition-based parallelism. The challenge becomes particularly acute for applications making remote state store calls, external API requests, or handling workloads with significant per-record processing overhead. When the OSO engineers encountered an application that could only process 250 records per second per thread despite running 96 threads, they knew they needed a fundamentally different approach.
Understanding the Sequential Processing Bottleneck
Kafka Streams’ processing model is elegantly simple: each task processes one record at a time, from start to finish, before moving on to the next. This sequential execution guarantees ordering and simplifies debugging, but it creates significant limitations when scaling high-performance applications.
The core constraint stems from the direct relationship between input topic partitions and processing parallelism. Each partition maps to exactly one task, and each task can only be processed by a single stream thread. This means your maximum parallelism is forever capped by the number of partitions in your input topics. Once you’ve assigned one task per thread, adding more compute resources becomes pointless—additional threads will simply sit idle.
The problem becomes particularly severe with high-latency operations. When a processor makes a remote state store call that takes 50 milliseconds, that entire thread is blocked for the duration. No other records can be processed by that thread until the remote operation completes and the record finishes its journey through the topology. The cumulative effect is devastating: if you have multiple high-latency operations per record, your throughput plummets regardless of available CPU or memory resources.
Consider a typical deduplication scenario. The OSO engineers encountered an application that performed a simple putIfAbsent
operation—checking if a record exists in the state store, and if not, adding it and forwarding downstream. With remote state stores, this became a read followed by a write operation, each potentially taking tens of milliseconds. With very low duplicate rates, nearly every record resulted in cache misses, forcing expensive remote lookups. The sequential processing model meant each thread could only handle one of these expensive operations at a time, severely limiting overall application throughput.
Designing an Asynchronous Processing Framework
When faced with an immovable deadline and a production system approaching capacity limits, the OSO engineers had to think beyond traditional scaling approaches. They couldn’t increase partitions without affecting the entire data pipeline, and they couldn’t add more nodes since they’d already reached one task per thread. The solution required breaking the fundamental assumption of sequential processing whilst preserving all correctness guarantees.
The breakthrough insight was recognising that true sequential ordering is only required within each key, not across an entire partition. Records with different keys can be processed in parallel without violating any ordering constraints. This realisation opened the door to processing multiple records simultaneously within a single task, effectively scaling parallelism to the number of unique keys rather than the number of partitions.
The architecture the OSO engineers developed centres on augmenting existing stream threads with dedicated worker thread pools. Instead of executing processor logic directly, stream threads now schedule work to these async thread pools and immediately return to processing the next record. This allows multiple records to be in flight simultaneously, dramatically improving resource utilisation when individual operations have high latency.
The framework operates through a concept called async events. When a record enters an async processor, the system creates an async event that tracks all information needed to complete processing. The stream thread hands this event to the async thread pool and continues processing other records. The async thread pool executes the processor logic in parallel, potentially handling dozens of records simultaneously. Once processing completes, the results are coordinated back to the stream thread for state store updates and downstream forwarding.
This design required careful consideration of thread safety and coordination. The framework intercepts state store operations and record forwarding, ensuring these operations execute in the stream thread to maintain single-threaded access to Kafka Streams internals. The async processor itself can execute arbitrary logic in parallel, but all interactions with the broader Kafka Streams framework remain properly synchronised.
Maintaining Processing Guarantees in Parallel Execution
Moving from sequential to parallel processing introduces significant complexity around maintaining correctness guarantees. The OSO engineers had to ensure their framework preserved exactly-once semantics, proper ordering, and all existing processing guarantees whilst enabling dramatic performance improvements.
The key insight for maintaining ordering was implementing same-key ordering rather than strict sequential ordering. The framework includes a scheduling queue that ensures records with identical keys are processed in offset order, whilst records with different keys can execute in parallel. This preserves the semantic correctness applications depend on whilst unlocking parallelism across the key space.
State store synchronisation required particular attention. Since Kafka Streams internals aren’t thread-safe, the framework intercepts all state store operations performed by async processors. Rather than executing writes immediately, these operations are captured and scheduled for execution in the stream thread. This ensures all state mutations occur in a single thread whilst allowing the actual processing logic to run in parallel.
The framework handles exactly-once semantics through careful coordination at commit boundaries. Before any offset commit occurs, the system blocks until all in-flight async operations complete. This ensures that when Kafka Streams commits offsets indicating successful processing, all corresponding records have actually finished processing. If any async operations fail, the framework can properly handle rollback scenarios within existing transaction boundaries.
Record forwarding to downstream processors follows a similar pattern. Async processors don’t directly forward records—instead, they register forwarding intentions with their async event. The framework executes these forwards in the stream thread once processing completes, maintaining proper coordination with the rest of the topology.
Real-World Performance Optimisation Results
The production deployment of this async processing framework delivered remarkable results that exceeded initial expectations. The application in question was a business-critical deduplication service sitting at the head of a major data pipeline, processing events with a 30-day TTL window across 50+ terabytes of state.
Before optimisation, the application struggled with severe performance limitations. Despite running 96 stream threads across multiple nodes, each thread could only process approximately 250 records per second. The primary bottleneck was the high latency of remote state store operations, particularly read operations that resulted in cache misses due to the low duplicate rate in the incoming data.
The transformation was immediate and dramatic. After deploying the async processing framework, the same application achieved identical throughput using only 12 stream threads. More importantly, each thread’s processing capacity increased from 250 to 1,500 records per second—a 6x improvement in per-thread performance. This translated to massive cost savings and created substantial headroom for future growth.
The success stemmed from the framework’s ability to overlap high-latency operations. Rather than waiting for each remote state store call to complete before processing the next record, the system could maintain dozens of concurrent operations. This dramatically improved resource utilisation and allowed the application to achieve throughput levels that would have been impossible with sequential processing.
Perhaps most importantly, the deployment carried minimal risk. The framework operates as a wrapper around existing processors, allowing teams to enable or disable async processing through configuration changes without modifying application logic or requiring downtime. This made it possible to validate the approach in production whilst maintaining rollback options.
Practical Implementation Guide
Implementing async processing in your Kafka Streams applications requires minimal code changes whilst delivering substantial performance improvements for the right workloads. The OSO engineers designed the framework to integrate seamlessly with existing applications, requiring only configuration changes and processor wrapping.
The implementation process begins with identifying suitable processors for async execution. The framework works best with processors that perform high-latency operations—remote state store calls, external API requests, database queries, or CPU-intensive computations. Lightweight operations that complete quickly may not benefit from async execution due to the overhead of thread coordination.
Converting a processor to async execution requires wrapping the processor supplier with an async wrapper. This typically involves changing a single line of code in your topology definition. The wrapped processor maintains the same interface and behaviour whilst executing asynchronously. You can selectively apply async processing to specific processors whilst leaving others unchanged, allowing for targeted optimisation of performance bottlenecks.
Configuration focuses primarily on sizing the async thread pools. Each stream thread receives its own dedicated async thread pool, allowing you to scale processing capacity independently from the number of stream threads. The optimal pool size depends on your specific workload characteristics—operations with higher latency typically benefit from larger pools that can maintain more concurrent operations.
The framework includes comprehensive monitoring and observability features. Async events track processing state and duration, providing visibility into parallel execution efficiency. You can monitor metrics like async thread pool utilisation, average processing times, and queue depths to optimise performance and identify potential issues.
Deployment safety was a primary design consideration. The framework can be enabled and disabled through configuration changes without requiring application restarts or topology modifications. This allows for safe production validation and provides rollback options if issues arise. The backwards and forwards compatibility ensures you can deploy the framework gradually across your application portfolio.
Trade-offs and Limitations
Like any performance optimisation technique, async processing involves trade-offs that teams must carefully consider. Understanding these limitations helps determine when async processing provides value and when alternative approaches might be more appropriate.
The framework cannot address all types of performance issues. Partition skew remains a challenge since tasks are still statically assigned to stream threads. If you have one heavily loaded partition whilst others are idle, async processing won’t help redistribute that load. The framework scales processing within each task but doesn’t change task assignment patterns.
Applications requiring strict cross-key ordering must carefully evaluate async processing compatibility. If your use case depends on processing records in exact offset order across different keys within a partition, the same-key ordering guarantee may be insufficient. However, most real-world applications only require ordering within individual keys, making this limitation rarely problematic in practice.
There’s a small overhead associated with coordinating records between stream threads and async thread pools. For very lightweight operations that complete in microseconds, this overhead might outweigh the benefits of parallel execution. The framework works best when individual operations take at least several milliseconds, allowing the parallelism benefits to exceed coordination costs.
Memory usage patterns change with async processing since multiple records may be in flight simultaneously. Applications must ensure adequate memory allocation for the increased concurrent processing load. Additionally, async thread pools consume additional threads beyond the standard stream threads, requiring consideration of overall thread management in your deployment environment.
The framework currently operates at the processor level, making it most suitable for processor API applications or those using the process
DSL operator. Whilst it can work with other DSL operators through processor extraction, this may require more complex integration for some use cases.
Breaking Through Performance Barriers
The async processing framework developed by the OSO engineers demonstrates that Kafka Streams applications can scale far beyond traditional partition-based limits whilst maintaining all correctness guarantees. By recognising that same-key ordering is sufficient for most use cases, they unlocked the ability to process multiple records in parallel within individual tasks.
This approach represents more than just a performance optimisation—it’s a fundamental shift in how we think about stream processing scalability. Rather than being constrained by input topic partitioning decisions made early in system design, applications can now scale processing capacity independently. This flexibility becomes increasingly valuable as systems grow and processing requirements evolve.
The 6x performance improvement achieved in production validates the approach whilst demonstrating the significant value available when traditional scaling approaches reach their limits. More importantly, the framework’s design ensures these performance gains come without sacrificing the reliability, simplicity, and correctness guarantees that make Kafka Streams valuable for mission-critical applications.
Looking forward, this work provides a foundation for broader changes in stream processing architecture. Future developments in Kafka Streams may incorporate similar concepts natively, potentially through initiatives like the processing threads model that would break the static relationship between consumers and processing threads.
For engineering teams facing similar scaling challenges, async processing offers a practical solution that can be implemented quickly whilst maintaining production stability. The framework proves that with careful design and implementation, you can break through performance barriers without abandoning the tools and patterns your teams already understand and trust.
The key insight—that parallelism can be achieved at the key level rather than the partition level—opens new possibilities for building high-performance stream processing applications. As data volumes continue to grow and latency requirements become more stringent, techniques like async processing will become essential tools for building systems that can scale with business demands whilst maintaining the reliability modern applications require.
This post first appeared on Read More