From Block to Object Storage: Engineering a Cost-Efficient Data Streaming Platform at Massive Scale

When you’re processing 3.5 terabytes of data per minute and handling 73 billion messages across 30+ Kafka clusters, every architectural decision compounds into millions in annual costs. The OSO engineers recently carried out an extensive migration from EBS-based Kafka deployments to object storage-backed streaming pipelines, uncovering critical insights about reliability, cost efficiency, and operational simplicity that challenge conventional wisdom about data streaming architecture.

This isn’t just another theoretical discussion about cloud storage options. This is about real-world engineering decisions that reduced object storage API requests by 71%, eliminated cross-availability zone charges that were bleeding budgets dry, and fundamentally rethought how we approach stateful stream processing in cloud-native environments.

The central insight? Disaggregated architectures using object storage fundamentally outperform traditional block storage approaches for high-throughput streaming workloads. But the migration requires rethinking established patterns around batching, coordination, and state management to achieve optimal cost-performance characteristics.

The Hidden Costs of Block Storage in Cloud-Native Streaming

Traditional wisdom suggests that block storage like EBS is the natural choice for data streaming platforms. After all, Kafka was designed with local disk in mind, and the low latency characteristics of block storage seem ideal for real-time processing. But at massive scale, this conventional approach reveals fundamental flaws that impact both reliability and cost.

The “Slow Volume” Problem

One of the most painful discoveries in running large-scale Kafka clusters on block storage is the slow volume phenomenon. You’re operating dozens of brokers, each with EBS volumes attached, and everything appears healthy in your monitoring dashboards. Then suddenly, write latencies spike. Your pipelines start backing up. Messages queue in producers. And the culprit? A single EBS volume has degraded performance, not failed completely, just become inexplicably slow.

This isn’t a theoretical concern. Engineers from PlanetScale documented this exact behaviour, and the OSO engineers encountered it repeatedly across production environments. The volume isn’t dead—it responds to health checks, shows no errors in CloudWatch—it’s just operating at a fraction of its normal throughput. When you’re maintaining strict latency SLAs measured in seconds, a single slow volume can cascade through your entire pipeline.

Kafka has developed workarounds for this issue. The health-aware sticky partitioner in the producer API can detect slow brokers and route traffic away from them. But these are precisely that: workarounds. They add complexity, don’t work in all scenarios, and fundamentally don’t solve the underlying reliability problem with block storage in cloud environments.

The OSO engineers implemented early detection systems, monitoring volume performance metrics and automatically replacing degraded volumes before they impacted production traffic. This works, but it’s operational overhead that shouldn’t exist in the first place.

Capacity Planning and Operational Complexity

Block storage forces you into a perpetual capacity planning exercise. Every topic needs retention configured. Every workload spike needs headroom. And in managed services like Amazon MSK, increasing disk capacity takes six to 24 hours.

Think about what that means operationally. You detect that disk usage is approaching 80%. You trigger a capacity increase. Then you wait. For potentially an entire day. During which time, if traffic spikes further or an configuration error causes unexpected data accumulation, you’re in a failure scenario. Brokers go down. Data loss occurs. And you’re helpless to respond quickly.

The alternative is over-provisioning. Pay for storage capacity you don’t currently need, just to handle worst-case scenarios. This seems prudent until you multiply it across 30+ clusters, each with multiple brokers, each with terabytes of provisioned storage sitting mostly empty. The waste compounds.

And beyond simple capacity, there’s the operational burden. Disk full scenarios require runbooks. Capacity alerts need tuning. Emergency response procedures need testing. You’re dedicating engineering time to managing storage infrastructure rather than building features or optimising pipelines.

The Cross-Availability Zone Tax

This is where block storage’s cost structure becomes truly problematic at scale. EBS volumes exist in a single availability zone. For high availability—which is non-negotiable for production systems—you need data replicated across multiple zones. In Kafka, this means your replication factor includes brokers in different zones.

Every write to a replicated topic incurs cross-AZ data transfer charges. In AWS, this is $0.01 per GB in each direction. That sounds trivial until you’re processing terabytes per minute. Suddenly, data transfer costs dwarf compute and storage costs combined.

The OSO engineers implemented optimisations like fetch-from-closest-replica in consumer configurations, which allows reads to occur from brokers in the same availability zone. This helps significantly for read-heavy workloads. There’s now even a proposal to implement similar optimisation for producers. But these are optimisations layered atop a fundamentally inefficient model.

The underlying problem persists: block storage’s single-AZ deployment model forces expensive data movement to achieve reliability. And when you operate multiple clusters across different VPCs, the costs multiply further. Transit gateways connecting VPC networks for cross-cluster replication add another layer of networking charges. The OSO engineers spent substantial time identifying and eliminating cross-VPC traffic patterns specifically to avoid these costs.

When Block Storage Still Makes Sense

None of this means block storage is always wrong. For workloads with strict latency requirements—single-digit millisecond SLAs—block storage still delivers superior performance. Write latencies to EBS are consistently low, and for single-AZ architectures where high availability is less critical, block storage remains a perfectly valid choice.

The introduction of GP3 volumes in AWS represented a significant improvement over GP2. The ability to provision throughput independently from capacity allowed the OSO engineers to eliminate IO bottlenecks without over-provisioning storage. In clusters where GP3 replaced GP2, throughput increased enough to reduce broker counts and consolidate clusters, saving substantial costs even whilst remaining on block storage.

The key insight isn’t that block storage is bad—it’s that block storage isn’t universally optimal. Different workloads have different requirements, and object storage opens up architectural patterns that weren’t previously feasible.

Object Storage as the Foundation for Disaggregated Streaming

Object storage services like S3 represent a fundamentally different approach to data persistence. Rather than attaching volumes to specific compute instances in specific availability zones, object storage provides a distributed, multi-AZ service accessed over HTTP APIs. This shift has profound implications for how we architect streaming platforms.

Multi-AZ by Default

The most immediate benefit of object storage is that it eliminates cross-AZ replication costs. When you write data to S3, it’s automatically replicated across availability zones within a region. There’s no additional charge for this replication. You’re not moving data between zones yourself—the storage service handles it transparently.

For high-throughput streaming workloads, this changes the economics entirely. The OSO engineers no longer needed to worry about which AZ a broker was in, or optimise producer and consumer configurations to minimise cross-AZ traffic. The data is accessible from any zone at the same cost.

This also simplifies networking architecture significantly. No more transit gateways for cross-VPC replication. No more VPC peering arrangements carefully designed to minimise data transfer. Object storage is accessed via regional endpoints, and the internal data movement is the storage service’s problem, not yours.

Operational Simplicity

With object storage, disk capacity ceases to be an operational concern. You don’t provision storage—you simply write data and pay for what you use. There’s no “disk full” scenario. No emergency capacity increases. No over-provisioning calculations.

The OSO engineers never had to replace an S3 bucket because it was performing poorly, which is something they routinely had to do with EBS volumes. Object storage’s reliability characteristics are so strong that if S3 is down, it’s considered a region-level failure. The entire cloud region is effectively unavailable, which simplifies disaster recovery planning immensely.

Retention policies become simpler too. Rather than carefully calculating how much disk space to provision based on retention periods and traffic estimates, you simply configure lifecycle rules. Object storage handles the deletion automatically. And if you need to adjust retention, it’s a configuration change, not a capacity planning exercise.

Cost Structure Advantages

The cost per GB stored in object storage is significantly lower than block storage. For S3 Standard, it’s $0.023 per GB-month, compared to $0.10 per GB-month for GP3 EBS volumes. That’s more than a 4x difference in storage costs alone.

But the cost advantages extend beyond simple storage pricing. S3 offers multiple storage classes optimised for different access patterns. Standard for frequently accessed data, Infrequent Access for occasional access, Glacier for archival. You can implement lifecycle policies that automatically transition data between classes based on age or access patterns.

The introduction of S3 Express One Zone represented a step change in object storage capabilities. It offers significantly lower latency than S3 Standard and, crucially, much lower cost per request. For some OSO pipelines, migrating from S3 Standard to S3 Express One Zone reduced costs by 4-5x. Four to five times cost reduction. For identical workloads. Just by changing storage class.

This kind of innovation happens quarterly in the object storage space. New features, new storage classes, new optimisations. Block storage evolves too, but the pace of innovation in object storage is dramatically faster, driven by massive-scale users pushing the boundaries of what’s possible.

The Latency Trade-Off

Object storage isn’t without trade-offs. Latency is higher than block storage. A read from EBS might take single-digit milliseconds. A read from S3 Standard might take 50-100 milliseconds. S3 Express One Zone improves this to 10-20 milliseconds, but it’s still higher than block storage.

For streaming workloads with strict latency requirements, this matters. If you need to acknowledge writes in under 10 milliseconds, object storage alone won’t meet that SLA. But for many workloads, the latency difference is acceptable, especially when you consider the cost savings and operational simplicity.

The key is matching storage architecture to workload characteristics. Not every pipeline needs to use object storage. But for high-throughput, cost-sensitive workloads where latency requirements are measured in seconds rather than milliseconds, object storage enables architectures that simply aren’t economically viable with block storage.

Beyond Tiered Storage—Direct Object Storage Integration

Kafka’s tiered storage feature seems like the obvious path to leveraging object storage. You write data to Kafka as normal, it lands on broker disks, and once it ages sufficiently, Kafka automatically moves it to object storage. When consumers need old data, Kafka fetches it from object storage transparently.

This works, and it provides real benefits. Broker disk sizes can shrink. Adding and removing brokers becomes faster because there’s less local state to replicate. For workloads with long retention requirements and primarily hot-data access patterns, tiered storage makes sense.

But tiered storage doesn’t address all use cases, and for some workloads, it introduces unnecessary complexity without delivering benefits.

When Tiered Storage Doesn’t Help

Consider a pipeline with 30-minute retention. Data arrives, gets processed by downstream consumers within minutes, and then gets deleted. For this workload, tiering data to object storage doesn’t help—the data never becomes cold enough to tier. You’re still paying for broker disk capacity even though the data could theoretically live in object storage from the start.

Or consider pipelines where data is always hot—continuous reprocessing, multiple consumers with different lag characteristics, or streaming analytics that need rapid access to recent history. For these workloads, data never really becomes cold. Tiering it to object storage just means additional complexity and potential latency when accessing tier data.

Tiered storage also doesn’t help if you want to consume data directly from object storage without going through Kafka brokers. The tiered data is managed by Kafka, accessed via Kafka APIs. If you want to run batch jobs against the data, or query it with SQL engines, or process it with tools that don’t speak Kafka protocol, you’re out of luck.

For the use cases the OSO engineers were optimising, tiered storage simply didn’t provide the flexibility needed. Short retention pipelines, always-hot data, and the desire to consume data in ways beyond the Kafka protocol all pushed towards different architectures.

The Serverless Kafka Evolution

Serverless Kafka implementations like WarpStream take tiered storage to its logical conclusion: remove broker disks entirely. Data goes directly from producers to object storage. Brokers become stateless proxies that coordinate reads and writes but don’t persist data themselves.

This architecture delivers real benefits. Brokers are truly stateless and can scale horizontally without state migration. Storage scales independently from compute. Cross-AZ costs are eliminated because data only moves to object storage, which handles multi-AZ replication internally.

The OSO engineers evaluated serverless Kafka when WarpStream launched. But there was a critical requirement: the solution needed to be open source and part of an established foundation. At the time, there was no open source serverless Kafka implementation. There’s now a Kafka Improvement Proposal (KIP) to make this part of Apache Kafka itself, but it wasn’t available when decisions needed making.

More fundamentally, there was a deeper question: if the goal is to write data to object storage efficiently, do we need the Kafka protocol at all?

Questioning the Kafka Protocol

The Kafka protocol is well-designed for what it does: reliable, ordered message delivery with strong consistency guarantees. But it’s also complex. Producers need to discover broker topology, manage partition assignments, handle retries and idempotency. Consumers need to manage group coordination, rebalancing, offset tracking.

All of this complexity makes sense when you’re interacting with a distributed system of brokers managing local state. But if the target is object storage, and you control both the producers and consumers, is the protocol overhead necessary?

For some OSO pipelines, the answer was no. These weren’t pipelines where arbitrary external systems needed to integrate. Both ends of the pipeline were services under OSO control, written specifically for the use case. The Kafka protocol’s generality wasn’t required—it was just overhead.

If you don’t need the Kafka protocol, you can write to object storage directly. But you can’t do this naively. Every message can’t be an individual S3 write—the request costs would be astronomical, and the latency would be unacceptable. You need batching, exactly like Kafka producers batch messages before sending them to brokers.

You also need coordination on the consumer side. How do consumers discover new data to process? How do they track progress? How do they handle failures and ensure exactly-once processing? These are all problems Kafka solves, and they don’t disappear just because you’ve removed Kafka from the architecture.

The solution requires a different abstraction layer. Something that can batch writes efficiently, coordinate consumers, track processing progress, and integrate naturally with stream processing frameworks. This is where Apache Flink enters the picture.

Apache Flink as the Object Storage Gateway

Apache Flink is a stream processing framework that excels at exactly this type of integration challenge. It provides abstractions for connecting to diverse data sources and sinks, built-in support for stateful processing, and sophisticated mechanisms for fault tolerance and exactly-once semantics.

For building object storage-backed streaming pipelines without Kafka, Flink proved to be the ideal foundation.

Flink’s Role in Disaggregated Architecture

Flink’s architecture naturally supports disaggregated storage. It separates compute (the processing of data) from state (the data itself). Historically, Flink stored state on local disks of TaskManager instances. But Flink has always supported checkpointing state to durable storage like S3, and the evolution towards treating remote storage as the primary storage location was a natural progression.

Flink can connect to Kafka, obviously—that’s a primary use case. But it can also connect to file systems, databases, message queues, and custom sources. This flexibility meant the OSO engineers could build pipelines that consumed from Kafka on one end and wrote to object storage on the other. Or consumed from object storage and wrote back to Kafka. Or bypassed Kafka entirely and used object storage for both ingestion and delivery.

The key insight is that Flink provides the stream processing semantics—windowing, aggregation, joins, exactly-once processing—whilst allowing flexibility in how data is physically stored and moved. You’re not locked into Kafka’s storage model. You can choose the storage backend that makes sense for each pipeline.

Flink’s Own Disaggregation Journey

Flink 2.0 represents a major architectural evolution towards disaggregated state management. Previously, state was stored on local disks, with periodic checkpoints to remote storage for fault tolerance. Flink 2.0 inverts this: remote storage (like S3) becomes the primary state store, with local disks serving as an optional cache.

This change addresses several cloud-native challenges. Local disk capacity no longer constrains state size—you can maintain terabytes of state backed by object storage. Checkpoints become dramatically faster because state is continuously streamed to remote storage rather than dumped periodically. Recovery times drop significantly because there’s no need to download state from remote storage—it’s already there.

The Flink community reports up to 94% reduction in checkpoint duration and up to 49x faster recovery after failures. These aren’t incremental improvements—they’re step changes in operational characteristics.

Flink 2.0 achieves this through several innovations. The ForSt state backend (a fork of RocksDB optimised for remote storage) handles the low-level state management. An asynchronous execution model allows state reads and writes to happen without blocking the main processing thread. And careful optimisation of the file system interface reduces the number of remote storage API calls needed.

But even with these improvements, there’s room for further optimisation specifically for object storage workloads. The file system interface in Flink was designed for generic file systems, not specifically for object storage. Some operations—like listing directories to discover new files—translate poorly to object storage’s API model.

This is where table formats become relevant.

Table Formats: Iceberg vs Paimon

Table formats like Apache Iceberg and Apache Paimon provide structured ways to organise data in object storage. They handle concerns like schema evolution, time travel, ACID transactions, and efficient querying. But they also influence how data is written and read, which matters for streaming use cases.

The OSO engineers initially explored Apache Iceberg. It’s mature, widely adopted, and has a file IO layer specifically optimised for object storage. The engineers from Tabular (the company behind Iceberg) recognised that operations cheap in traditional file systems—like listing directories—are expensive in object storage. So they built a custom file IO implementation that minimises these operations.

Iceberg worked well for certain use cases, particularly stateful pipelines where data needed to be queryable and consolidated in a single location. But two issues emerged. First, as throughput increased, latency increased proportionally. Higher throughput meant more data being batched and written, which meant larger files and longer time-to-visibility. For some pipelines, this latency degradation was unacceptable.

Second, Iceberg lacked consumer offset tracking. Kafka tracks consumer group offsets automatically. Iceberg, being designed primarily for batch and interactive query workloads, doesn’t have an equivalent mechanism. For streaming workloads where you need to track processing progress and resume from the correct position after failures, this was a significant gap.

Apache Paimon addressed both issues. It’s designed specifically with streaming workloads in mind, developed by engineers closely involved with the Flink community. Paimon includes consumer tracking, allowing streaming jobs to maintain their position in the data and resume correctly after failures. And latency characteristics are better suited to streaming—low and consistent even as throughput scales.

Paimon isn’t perfect. Request patterns still need optimisation. As files accumulate, head requests (checking if a file exists) increase even if no new data is being written. The OSO engineers had to implement aggressive retention policies to delete processed files quickly and keep request counts manageable.

But for stateful pipelines—workloads where you want data consolidated in object storage, queryable, with schema evolution and time travel capabilities—Paimon proved to be the right choice. The OSO engineers continue to use it extensively and expect it to improve further as the Paimon community evolves the project.

Request Optimisation—The 71% Reduction Strategy

Using object storage efficiently requires careful attention to request patterns. Object storage has a different cost model than block storage. You pay per request, not just for storage and throughput. An inefficient implementation can rack up millions of API calls per hour, and at fractions of a cent per thousand requests, costs can spiral surprisingly quickly.

The OSO engineers focused intensely on minimising object storage requests whilst maintaining low latency and high throughput. The optimisations they developed reduced request counts by 71% compared to initial implementations—a dramatic improvement that translated directly to reduced operational costs.

Producer-Side Optimisation: Key-By Operations

One of the most effective optimisations involved how data is partitioned before writing to object storage. In a typical Flink pipeline, you might have multiple parallel instances processing data, and each instance writes data for multiple downstream destinations (partitions, topics, or keys).

Without optimisation, each Flink instance writes data for all destinations it encounters. If there are 100 destinations and 10 parallel instances, each instance might write 10 files per checkpoint—one per destination it happened to see data for. That’s 100 total files, which is fine.

But here’s the problem: object storage request costs are often dominated by the number of separate write operations, not the total data volume. Writing 100 small files costs more in API requests than writing 10 large files, even if the total data is the same.

The solution is to use Flink’s keyBy operator to shuffle data before writing. Group all records for a given destination and route them to a specific Flink instance. Now, instead of every instance writing a little bit of data for every destination, each instance writes a lot of data for a few destinations.

This consolidation means fewer files per checkpoint. If you have 100 destinations and 10 parallel instances, and you route all data for each destination to a specific instance, you write exactly 100 files—one per destination. No more, no less. And each file is larger, containing all the data for that destination from all upstream sources.

The OSO engineers measured a 71% reduction in request counts after implementing key-by operations. Seventy-one percent. This single change delivered more cost savings than most other optimisations combined.

There are trade-offs, of course. Shuffling data means increased network traffic within the Flink cluster. Data that could have been processed locally now moves between TaskManagers. This consumes network bandwidth and adds latency.

You also introduce the possibility of data skew. If one destination receives significantly more traffic than others, the Flink instance handling that destination becomes a bottleneck. Auto-scaling becomes more complex because you can’t simply add more instances—you need to rebalance which instances handle which destinations.

But compared to the cost savings from reduced API requests, these trade-offs were acceptable. Internal network traffic is essentially free within a single availability zone. Data skew can be managed with careful key selection and monitoring. And the operational complexity of managing skewed workloads is lower than the financial cost of inefficient object storage usage.

Consumer-Side Challenges: File Discovery

On the consumer side, the challenge is discovering new files to process. Kafka makes this simple: consumers subscribe to topics and receive messages as they arrive. With object storage, there’s no equivalent push mechanism. Consumers need to actively discover when new data is available.

The naïve approach is to list the object storage bucket periodically, checking for new files. This works, but it’s extremely inefficient. If you have 1,000 processed files and you list the bucket every 2 seconds, that’s 43,200 list operations per day, checking the same files repeatedly.

Flink’s file source connector does this, and it compounds the problem. Due to how Flink’s file system interface works, even files that have already been processed trigger multiple API calls. For 100 files, Flink might make 200-300 requests every few seconds just to confirm there’s nothing new to process.

This isn’t Flink’s fault—the file system interface was designed for HDFS and local file systems, where listing directories is cheap. Object storage is fundamentally different. List operations are expensive, and doing them frequently for large sets of files burns through API request budgets quickly.

The OSO engineers implemented several mitigations. First, they reduced polling frequency. Instead of checking every 2 seconds, check every 10 seconds or 30 seconds. This trades latency for cost—data takes longer to be discovered and processed, but you make fewer API calls.

Second, they implemented aggressive file retention. As soon as a file is processed, delete it. Don’t leave processed files sitting around because every list operation pays a cost proportional to the number of files, even if they’re already processed.

This works, but it’s risky. If your retention is too aggressive and a failure occurs before processing completes, you might delete files that still need processing. Data loss. The OSO engineers carefully tuned retention policies to balance cost savings against reliability, but it’s not an ideal solution.

Alternative Coordination Mechanisms

Several alternative approaches emerged from examining how other organisations solved this problem. Some companies use object storage event notifications. S3 can send a notification to an SQS queue or SNS topic whenever a new object is created. Consumers subscribe to the queue and receive notifications about new files immediately.

This eliminates polling entirely. No more list operations. No more checking for new files. The storage service tells you when data arrives. It’s elegant and efficient from a request-cost perspective.

The downside is latency. Object storage notifications aren’t instant. There can be delays—seconds or even minutes—between when a file is written and when the notification is delivered. For latency-sensitive pipelines, this is unacceptable.

Another approach, used by several organisations, is to use an external transactional store for coordination. When the producer writes a file to object storage, it also writes metadata about that file to a system like ZooKeeper or DynamoDB. Consumers query the transactional store to discover new files, then read the actual data from object storage.

This is fast and efficient. Transactional stores are designed for low-latency reads and writes. Consumers can discover new files in milliseconds. And you’re not paying for repeated list operations against object storage—you’re paying for much cheaper transactional store queries.

The trade-off is operational complexity. Now you have an additional system to run and manage. If ZooKeeper goes down, your pipeline stops even though object storage is fine. You need to ensure the transactional store can scale to your write rate. And you’ve introduced consistency concerns—what happens if metadata is written but the object storage write fails?

Despite these challenges, this approach is promising and something the OSO engineers actively explored. The key is choosing the right transactional store and implementing careful error handling to maintain consistency between metadata and actual data.

Building a Custom Table Format

The most ambitious approach is to build a table format specifically optimised for low-latency streaming workloads on object storage. This is what the OSO engineers began developing as a next-generation connector for Flink.

The idea is to use object storage itself for coordination, but in a way that’s more efficient than simply listing files. Drawing on lessons from Paimon and ScionDB (a key-value database built on object storage), the goal is to create a lightweight metadata layer that tracks which files exist, which have been processed, and what work remains.

Producers write data files to object storage as before. But they also update a metadata structure—potentially another file in object storage, or a transactional manifest—that consumers can read to discover new data. This metadata is small and structured, so reading it is fast and cheap compared to listing potentially thousands of data files.

Consumers read the metadata, identify new files to process, update their progress, and coordinate with other consumers to avoid duplicate processing. All using object storage as the underlying storage substrate, but with a protocol specifically designed for streaming use cases.

This approach combines the benefits of previous methods. No external dependencies beyond object storage itself. Low latency because metadata updates are lightweight. Efficient request patterns because consumers read structured metadata rather than listing all files.

It’s complex to implement correctly. You need to handle concurrent updates from multiple producers, ensure consistency in the face of failures, and optimise the metadata structure to minimise API calls. But the OSO engineers believe this approach represents the future of streaming on object storage and are actively working to make it open source for the broader community.

Practical Implementation Considerations

Theory is useful, but implementation is where abstractions meet reality. The journey to disaggregated, object storage-backed streaming architectures isn’t a binary switch. Different workloads have different requirements, and the right architecture depends heavily on your specific use case.

Matching Storage to Workload Characteristics

Block storage remains the right choice for certain workloads. If you have strict latency SLAs—acknowledging writes in under 10 milliseconds, for example—block storage is likely your only option. Object storage latency, even with optimisations like S3 Express One Zone, won’t meet these requirements.

Single-AZ architectures also favour block storage. If you’re not replicating across availability zones, you’re not paying cross-AZ charges, which eliminates one of object storage’s primary cost advantages. The operational simplicity of object storage still matters, but the cost case is weaker.

Low-throughput workloads might not benefit from object storage either. If you’re processing megabytes per minute rather than terabytes, the absolute cost difference between storage options is minimal. The engineering effort to migrate might not be justified by the savings.

Kafka tiered storage makes sense for topics with long retention requirements and cold data access patterns. If you retain data for 30 days but consumers typically only read the most recent hour, tiering the old data to object storage reduces broker disk requirements without impacting normal operations. This is especially true if you’re using managed services where broker disk costs are high.

Direct object storage integration—bypassing Kafka entirely—makes sense for high-throughput, stateless pipelines where both ends are under your control. Cross-VPC data movement, cross-region replication, and cost-sensitive batch-to-streaming migrations all benefit from writing directly to object storage.

Stateful pipelines that need queryable, consolidated data stores benefit from table formats like Paimon. If your workload involves Change Data Capture, stream enrichment, or maintaining materialised views, a table format provides schema management, time travel, and query capabilities that raw files don’t offer.

Architectural Patterns That Emerged

The OSO engineers didn’t migrate everything to object storage overnight. They developed patterns based on workload characteristics and migrated incrementally.

For pipelines that needed Kafka semantics but suffered from cross-AZ costs, they implemented hybrid architectures. Real-time data flows through Kafka for low latency. Historical data is simultaneously written to object storage. Consumers can read from Kafka for recent data and query object storage for history. This mirrors what Pinterest built with “The Store,” where client libraries abstract whether data comes from Kafka or object storage.

For stateless pipelines simply moving data between services, they eliminated Kafka entirely. Flink jobs consume from the source, batch data, write to object storage, and downstream Flink jobs consume from object storage. No brokers, no Kafka protocol, no cross-AZ charges. Just efficient data movement through object storage.

For stateful pipelines requiring complex processing, they adopted Paimon as the storage layer. Change Data Capture from databases flows into Paimon tables via Flink. Stream processing jobs maintain state in Paimon. Downstream consumers query Paimon tables for the latest state. This consolidated storage approach simplified operational overhead significantly compared to managing separate Kafka topics, state stores, and output databases.

Migration Strategy Considerations

The migration path matters as much as the destination. The OSO engineers learned to start with stateless pipelines. These are lower risk because there’s no complex state to migrate, and the failure modes are simpler. If something goes wrong, you can roll back more easily.

Measuring success through clear metrics proved essential. Request count per GB processed became a key performance indicator. Not just cost per GB stored, but the efficiency of the entire pipeline measured in API calls. Latency at different percentiles (p50, p95, p99) tracked whether optimisations maintained acceptable performance characteristics. And actual cost per TB processed provided the bottom-line measure of whether changes delivered value.

The OSO engineers also maintained careful cost models comparing different approaches. What does it cost to process 1TB through Kafka with EBS storage versus object storage-backed Flink pipelines? Include compute, storage, networking, and API requests. Run the models for different throughput levels and retention requirements. This data-driven approach ensured architectural decisions were based on economics, not just engineering preferences.

The Cloud-Native Future

The migration from block storage to object storage represents more than optimisation—it’s a fundamental shift in how we architect data streaming platforms. The OSO engineers’ experience processing massive data volumes demonstrated that disaggregated architectures built on object storage deliver superior reliability, dramatic cost savings, and operational simplicity compared to traditional approaches.

The numbers tell the story. A 71% reduction in API requests through intelligent batching and key-by operations. Elimination of cross-AZ charges that previously represented the largest single cost component. Four to five times cost reduction in specific pipelines through storage class optimisation. Faster recovery times, simpler operational procedures, and infrastructure that scales independently in compute and storage dimensions.

But this transition required rethinking established patterns. Kafka’s tiered storage, whilst useful for specific workloads, doesn’t address all use cases. Serverless Kafka solutions improve the model but still introduce network hops that aren’t necessary for controlled environments. The most efficient approach for high-throughput, stateless pipelines often bypasses the Kafka protocol entirely, using stream processing frameworks like Flink to write directly to object storage with carefully optimised batching and coordination mechanisms.

The key insight is matching storage architecture to workload characteristics. Not every pipeline benefits from object storage, and not every object storage implementation delivers optimal cost-performance. Success requires understanding trade-offs: latency versus cost, operational simplicity versus control, protocol compatibility versus efficiency.

As object storage services continue to innovate—new storage classes, lower latency, reduced request costs—the advantages of disaggregated architectures will only grow. Stream processing frameworks like Flink mature their object storage integrations, with Flink 2.0’s disaggregated state management representing a watershed moment. Table formats evolve to better support streaming use cases, with projects like Paimon explicitly targeting real-time workloads.

The patterns established today through careful engineering and measured experimentation will define best practices for the next generation of real-time data platforms. The future of data streaming is disaggregated. The future is cloud-native. And the future writes to object storage, because that’s where the economics and architecture align for massive-scale, high-throughput workloads.

For organisations processing terabytes per minute across distributed clusters, every optimisation matters. The journey from block storage to object storage isn’t just about reducing costs—it’s about building platforms that scale economically, operate reliably, and evolve continuously with cloud infrastructure innovations. The OSO engineers proved this is possible, one request optimisation at a time.

This post first appeared on Read More