AWS S3 as Storage for Data Products

Category: Data Platform Platform: AWS

Context

How do we store analytical data, so that they can be processed efficiently and shared with other teams as data products?

Most teams use Apache Kafka to exchange data and publish domain events. On average, we expect a total of 10 TB analytical data per team.

Many domain teams use AWS services for their operational systems.

Decision

We use AWS S3 as storage for data products.

Consequences

Scalable storage, with a pay-per-use cost model
Costs ~30 USD/TB for standard storage class
Managed connectors available, such as Amazon S3 Sink Connector for Confluent Platform
Engineers need to learn and use appropriate file formats
Data syntax needs to be configured in a meta store (Glue Metastore)
No native integration for data analytics tools (Looker, Tableau, …)
AWS IAM to manage access management

Considered Alternatives

AWS Redshift

Automation

S3 buckets can be created with Terraform

This site is open source. Improve this page.