Data Mesh Governance / Policies / Interoperability / File Format
Category: Interoperability
Platform: Databricks, Azure Synapse Analytics, Generic Data Lake
Context
Data products are stored as files on Azure Data Lake Storage Gen2 (Data Product Storage).
To ensure interoperability and consistent usage patterns, we want to agree on a common file format.
We assume that data products frequently will be combined across domains.
Decision
We use JSON as a file format for data products.
Entries are separated with a new line (ndson).
Consequences
- Supports complex structures, such as arrays and objects
- No need for managing a schema to write data
- JSON is a simple format, known to all engineers
- Widespread across many tools (such as Kafka Connectors), which makes data ingestion simple
- Expensive IO and retrieval costs when querying data sets, as it is not compressed
- Full reads make JOIN operations slow and expensive, compared to column-oriented file formats
- Follow-Up Questions
- Partitioning
- How to document the schema?
- Timestamp format
Automation
- Automated testing: Query all data products periodically and try to deserialize latest file