优步推动Kafka分层存储，效率之争引爆

##优步推动 Apache Kafka 分层存储功能，引发效率之争

**2024年8月20日** – 交通出行公司优步（Uber）近日宣布，他们在 Apache Kafka 中添加了新的分层存储功能，旨在解决大型 Kafka 集群的可伸缩性和效率问题。这一举措引发了业界关于分层存储在实际应用中的效率和复杂性方面的讨论。

优步的这项新功能允许 Kafka将其存储功能从本地代理磁盘扩展到远程存储系统，例如 HDFS、Amazon S3、Google Cloud Storage 和 Azure Blob Storage。这意味着 Kafka 集群可以独立于计算资源扩展存储，从而降低成本和运维复杂性。

优步表示，这项功能的灵感来源于 Kafka 集群传统扩缩方式的局限性。通过添加更多代理节点来扩缩集群会增加不必要的内存和 CPU 资源消耗，降低存储成本效率。分层存储则将存储和处理分离，可以更有效地管理数据。

分层存储架构包含两个存储层：本地层和远程层。本地层由代理的本地存储组成，用于存储延迟敏感数据；远程层则用于存储历史数据，例如 HDFS 或云对象存储。

红帽公司（Red Hat）认为，分层存储的优势包括：

* **弹性**: 计算和存储资源可以独立扩缩。
* **隔离性**: 延迟敏感数据可以通过本地层提供，而历史数据则可以通过远程层提供，无需更改 Kafka 的客户端。
* **成本效益**: 远程对象存储系统通常比快速的本地磁盘便宜，使得 Kafka 的存储更便宜，并且几乎不受限制。

AWS 通过 Amazon Managed Streaming for Apache Kafka（Amazon MSK）分层存储进一步发展了这一概念。AWS 认为，分层存储可以显著提高 Kafka 集群的可用性和弹性，主要优势包括：

* **更快的代理恢复**: 数据会随着时间的推移自动从更快的 Amazon Elastic Block Store（Amazon EBS）卷移动到更具成本效益的存储层，从而加快代理故障恢复速度。
* **高效的负载平衡**: 分层存储可以减少重新分配分区时需要移动的数据量，提高负载平衡效率。
* **更快的扩缩**: 使用分层存储可以无缝扩缩 MSK 集群，无需进行大量的数据传输和更长时间的分区重新平衡。

然而，并非所有人都对分层存储持乐观态度。WarpStream 的 Richard Artoul 认为，虽然分层存储可以帮助降低成本，但它可能会引入新的复杂性和潜在的故障模式。他指出，管理两个存储层会增加复杂性，可能增加运维开销并影响系统的可靠性。此外，从远程存储中获取数据可能会引入延迟，影响实时处理能力。

分层存储功能的实际应用效果还需要更多时间和案例来验证。虽然这项技术在降低成本和提高效率方面具有潜力，但其复杂性和潜在的性能影响也需要引起重视。未来，分层存储技术的发展将取决于如何在降低成本和提高效率之间取得平衡，以及如何解决潜在的复杂性和性能问题。

英语如下：

##Uber Pushes for Kafka Tiered Storage, Sparking Efficiency Debate

**Keywords:** Uber, Kafka, Storage

**August 20, 2024** – Ride-hailing company Uber recently announced the addition of new tiered storage capabilities to Apache Kafka, aiming to address scalability and efficiency issues in largeKafka clusters. This move has sparked industry discussions regarding the efficiency and complexity of tiered storage in real-world applications.

Uber’s new feature allows Kafka toextend its storage capabilities from local broker disks to remote storage systems like HDFS, Amazon S3, Google Cloud Storage, and Azure Blob Storage. This means Kafka clusters can scale storage independently of compute resources, reducing costs and operational complexity.

Uber states that this feature was inspired by the limitations of traditional Kafka cluster scaling methods. Scaling clusters by adding more broker nodes increases unnecessary memory and CPU resource consumption, lowering storage cost efficiency. Tiered storage separates storage and processing, enabling more efficientdata management.

The tiered storage architecture comprises two storage tiers: local and remote. The local tier consists of the broker’s local storage, used for storing latency-sensitive data; the remote tier stores historical data, such as HDFS or cloud object storage.

Red Hat believes that the advantages of tiered storage include:

* **Elasticity:** Compute and storage resources can be scaled independently.
* **Isolation:** Latency-sensitive data can be served through the local tier, while historical data is served through the remote tier without changes to Kafka clients.
* **Cost-effectiveness:** Remote object storage systems are typically cheaper than fastlocal disks, making Kafka storage cheaper and virtually unlimited.

AWS further develops this concept through Amazon Managed Streaming for Apache Kafka (Amazon MSK) tiered storage. AWS believes that tiered storage can significantly improve Kafka cluster availability and elasticity, with key advantages including:

* **Faster Broker Recovery:** Data is automatically moved from fasterAmazon Elastic Block Store (Amazon EBS) volumes to more cost-effective storage tiers over time, accelerating broker failure recovery.
* **Efficient Load Balancing:** Tiered storage can reduce the amount of data that needs to be moved during partition reassignment, improving load balancing efficiency.
* **Faster Scaling:** Tieredstorage allows for seamless scaling of MSK clusters without extensive data transfers and longer partition rebalancing times.

However, not everyone is optimistic about tiered storage. Richard Artoul of WarpStream believes that while tiered storage can help reduce costs, it may introduce new complexities and potential failure modes. He points out that managing two storagetiers increases complexity, potentially adding operational overhead and impacting system reliability. Additionally, fetching data from remote storage may introduce latency, affecting real-time processing capabilities.

The real-world performance of tiered storage features requires more time and case studies to validate. While this technology holds potential for cost reduction and efficiency improvements, its complexityand potential performance impacts need to be considered. The future development of tiered storage technology will depend on finding a balance between cost reduction and efficiency gains, and addressing potential complexities and performance issues.

【来源】https://mp.weixin.qq.com/s/kcLcdhMXqSEorjqLr0dt7w