Data Mesh Governance / Policies / Interoperability / Data Product Specification
Category: Interoperability
How do we specify the syntax and semantics of data products in a standardized way?
We specify data products with Agile Lab’s Data Product Specification.
id: urn:dmb:dp:my_domain:my_data_product:1
name: my data product
fullyQualifiedName: My Data Product
description: this data product is representing the xxx functional entity
kind: dataproduct
domain: my_domain
version: 1.0.0
environment: development
dataProductOwner: tom_smith_corp.com
dataProductOwnerDisplayName: Tom Smith
email: mailto:distribution_list@corp.com
ownerGroup: dataproduct1_corp.com
devGroup: dataproduct1_dev_corp.com
informationSLA: 2WD
status: DRAFT
maturity: Strategic
billing: {}
tags: []
specific: {}
components:
- id: urn:dmb:cmp:my_domain:my_data_product:1:my_raw_s3_port
name: my raw s3 port
fullyQualifiedName: My Raw S3 Port
description: s3 raw output port
kind: outputport
version: 1.0.1
infrastructureTemplateId: microservice-id-1
useCaseTemplateId: template-id-1
dependsOn: []
platform: CDP on AWS
technology: s3_cdp
outputPortType: Files
creationDate: 05-04-2022T16:53:00.000Z
startDate:
processDescription: this output port is generated by a Spark Job scheduled every day at 2AM and it lasts for approx 2 hours
dataContract:
schema: []
SLA:
intervalOfChange: 1 hours
timeliness: 1 minutes
upTime: 99.9%
termsAndConditions: only usable in development environment
endpoint: https://myurl/development/my_domain/my_data_product/1.0.0/my_raw_s3_port
dataSharingAgreements:
purpose: this output port want to provide a rich set of profitability KPIs related to the customer
billing: 5$ for each full scan
security: In order to consume this output port an additional security check with compliance must be done
intendedUsage: the dataset is huge so it is recommended to extract maximum 1 year of data and to use these KPIs in the marketing or sales domain, but not for customer care
limitations: is not possible to use this data without a compliance check
lifeCycle: the maximum retention is 10 years, and eviction is happening on the first of january
confidentiality: if you want to store this data somewhere else, PII columns must be masked
tags:
- tagFQN: experimental
source: Tag
labelType: Manual
state: Confirmed
- tagFQN: structured
source: Tag
labelType: Manual
state: Confirmed
sampleData: {}
semanticLinking: {}
specific:
directory: history
bucket: ms-datamesh-s3
- id: urn:dmb:cmp:my_domain:my_data_product:1:my_view_impala_port
name: my view impala port
fullyQualifiedName: My View Impala Port
description: impala view output port
kind: outputPort
version: 1.1.0
infrastructureTemplateId: microservice-id-2
useCaseTemplateId: template-id-2
dependsOn: [urn:dmb:cmp:my_domain:my_data_product:1:my_raw_s3_port]
platform: CDP on AWS
technology: impala_cdp
outputPortType: SQL
creationDate: 05-04-2022T17:00:00.000Z
startDate:
processDescription:
dataContract:
schema:
- name: employeeId
dataType: string
description: global addressable identifier for an employee.
constraint: PRIMARY_KEY
tags:
- tagFQN: GlobalAddressableIdentifier
source: Tag
labelType: Manual
state: Confirmed
- name: first_name
dataType: string
description: employee's first name
constraint: NOT_NULL
tags:
- tagFQN: PII
source: Tag
labelType: Manual
state: Confirmed
- name: last_name
dataType: string
description: employee's last name
constraint: NOT_NULL
tags:
- tagFQN: PII
source: Tag
labelType: Manual
state: Confirmed
- name: birthdate
dataType: date
description: employee's birthdate
constraint: NOT_NULL
tags: []
- name: gender
dataType: string
description: employee's gender
constraint: NOT_NULL
tags: []
- name: residential_address
dataType: struct
description: employee's residential address
constraint: NOT_NULL
tags:
- tagFQN: PII
source: Tag
labelType: Manual
state: Confirmed
- name: first_hire_date
dataType: date
description: the date of his/her first hire in mybank. No matter is a temporary or permanent contract
constraint: NOT_NULL
tags: []
- name: last_working_date
dataType: date
description: the last day the employee worked for mybank
constraint: NULL
tags: []
- name: last_update
dataType: date
description: the last date the record has been updated
constraint: NULL
tags: []
- name: businessTs
dataType: timestamp
description: the business timestamp, to be leveraged for time-travelling
constraint: NOT_NULL
tags: []
- name: writeTs
dataType: timestamp
description: the technical (write) timestamp, to be leveraged for time-travelling
constraint: NOT_NULL
tags: []
SLA:
intervalOfChange: 1 hours
timeliness: 1 minutes
upTime: 99.9%
termsAndConditions: only usable in development environment
endpoint: https://myurl/development/my_domain/my_data_product/1.0.0/my_raw_s3_port
biTempBusinessTs: businessTs
biTempWriteTs: writeTs
dataSharingAgreements:
purpose: this output port want to provide a rich set of profitability KPIs related to the customer
billing: 5$ for each full scan
security: In order to consume this output port an additional security check with compliance must be done
intendedUsage: the dataset is huge so it is recommended to extract maximum 1 year of data and to use these KPIs in the marketing or sales domain, but not for customer care
limitations: is not possible to use this data without a compliance check
lifeCycle: the maximum retention is 10 years, and eviction is happening on the first of january
confidentiality: if you want to store this data somewhere else, PII columns must be masked
tags: []
sampleData:
columns:
- name
- surname
rows:
- - Jace
- Beleren
- - Gideon
- Jura
- - Chandra
- Nalaar
semanticLinking: {}
specific:
database: my_database
table: my_table
location: /my_path
schema:
firstName: string
lastName: string
format: PARQUET
- id: urn:dmb:cmp:my_domain:my_data_product:1:my_spark_workload
name: my spark workload
fullyQualifiedName: My Spark workload
description: spark batch workload
kind: workload
version: 1.1.1
infrastructureTemplateId: microservice-id-3
useCaseTemplateId: template-id-3
platform: CDP on AWS
technology: spark
workloadType: batch
connectionType: DataPipeline
tags: []
readsFrom: [urn:dmb:ex:mainframe_db2_database]
specific:
artifactory: ms-datamesh-s3
artefact: /path/to/my/spark/workload.jar
service: my_cdp_service
cluster: my_cde_cluster
className: com.mycompany.MySparkApp
args:
- arg1
- arg2
driverCores: 1
driverMemory: 4g
executorCores: 4
executorMemory: 4g
numExecutors: 3
schedule:
cronExpression: 0 0 0,22 ? * * *
- id: urn:dmb:cmp:my_domain:my_data_product:1:my_observability
name: my observability
fullyQualifiedName: My Observability
description: observability for my data product
kind: observability
infrastructureTemplateId: microservice-id-4
useCaseTemplateId: template-id-4
version: 1.1.1
endpoint: http://develop/my_domain/my_data_product/1.0.0/obs
completeness:
dataProfiling:
freshness:
availability:
dataQuality:
specific:
restApiName: obs_api
stageName: data_mesh
bucket: ms-datamesh-s3
obsEndpoint:
- artifact: path/to/my/obs_dq.jar
handler: com.mycompany.MyHandler::handleRequest
lambdaname: my_data_product_obs_dq
awsResourceName: my_data_product_obs_dq
awsResourcePath: /data_quality
- artifact: path/to/my/obs_workload.jar
handler: com.mycompany.MyHandler::handleRequest
lambdaname: my_data_product_obs_workload
awsResourceName: my_data_product_obs_workload
awsResourcePath: /workload
Source: https://github.com/agile-lab-dev/Data-Product-Specification/blob/main/example.yaml
dataContract
has a one-to-one relationship to the output port. It cannot be used to describe bilateral data contracts between data product provider and consumers.