Date: August 30, 2024
Canva, the popular online design platform, has made a significant shift in its product analytics architecture by choosing Amazon Kinesis Data Streams (KDS) over the combination of Amazon Simple Notification Service (SNS) and Simple Queue Service (SQS). This strategic decision has resulted in an impressive 85% reduction in costs while handling an astronomical 25 billion events daily, according to an article by Rafal Gancarz on InfoQ.
Background and Evaluation Process
Canva’s product analytics platform is a critical component of its service, supporting various user-facing features such as personalization, recommendations, usage statistics, and insights. The platform also plays a pivotal role in enabling A/B testing for new product features. Given the scale of operations, the data pipeline required not only high throughput but also high availability (99.999% uptime), cost-effectiveness, reliability, and user-friendliness.
In the initial stages of their Minimum Viable Product (MVP), Canva’s team responsible for the event-driven architecture (EDA) used a combination of AWS SQS and SNS. These services were easy to set up and provided excellent elasticity and scalability. However, the associated costs accounted for 80% of the running architecture, prompting the team to seek alternative solutions that could meet performance requirements at a lower cost.
The Decision for KDS
The team evaluated two other AWS services: Amazon Managed Streaming for Apache Kafka (MSK) and Amazon Kinesis Data Stream (KDS). After a thorough comparison of costs, performance, and maintainability, KDS emerged as the winner. KDS offered a significantly lower cost (85% cheaper than SQS+SNS) and lower maintenance costs, despite having a higher latency of 10-20 milliseconds compared to MSK, which was deemed acceptable.
Optimizations and Cost Savings
To further enhance the cost-effectiveness of the KDS solution, the team implemented event batching and zstd compression, achieving a compression ratio of 10x with a batch compression delay of 100 milliseconds. This move alone was estimated to save $600,000 annually.
One challenge with using KDS was the high tail latency (over 500 milliseconds) and the hard limit of 1MB/s per shard for throughput. To address this, the team developed a fallback logic using SQS queues, which resulted in a p99 latency of under 20 milliseconds, while the monthly cost for SQS was less than $100. This fallback option also served as a failover mechanism in case KDS experienced severe service degradation or interruptions.
Ensuring Compatibility and Data Quality
Canva’s engineers used Protocol Buffers to ensure the architecture’s describability and to evolve event definitions over time. The company already uses Protocol Buffers to define contracts between microservices, but for event definitions, it required full backward and forward compatibility. The team also created a proprietary code generation tool on top of protoc, named Datumgen, which validates compatibility requirements and generates code in multiple languages. Additionally, the tool extracts metadata from event definitions to enhance the event directory data, which includes details about technical and business owners as well as field descriptions.
This meticulous approach to documentation and event schema management helps Canva maintain high data quality, avoid costly runtime issues due to schema incompatibility, and enables engineers to discover available product analytics events.
Conclusion
Canva’s adoption of Amazon KDS has not only led to substantial cost savings but also enhanced the performance and reliability of its product analytics platform. By leveraging KDS and implementing innovative optimization strategies, Canva has set a benchmark for cost-effective and efficient data processing in the tech industry. This move underscores the importance of continuously evaluating and optimizing technology stacks to meet the evolving demands of high-scale operations.
Views: 0