Data Mesh Governance / Policies / Interoperability / Data Product Specification

Agile Lab Data Product Specification

Category: Interoperability

Context

How do we specify the syntax and semantics of data products in a standardized way?

Decision

We specify data products with Agile Lab’s Data Product Specification.

Example

id: urn:dmb:dp:my_domain:my_data_product:1
name: my data product
fullyQualifiedName: My Data Product
description: this data product is representing the xxx functional entity
kind: dataproduct
domain: my_domain
version: 1.0.0
environment: development
dataProductOwner: tom_smith_corp.com
dataProductOwnerDisplayName: Tom Smith
email: mailto:distribution_list@corp.com
ownerGroup: dataproduct1_corp.com
devGroup: dataproduct1_dev_corp.com
informationSLA: 2WD
status: DRAFT
maturity: Strategic
billing: {}
tags: []
specific: {}
components:
  - id: urn:dmb:cmp:my_domain:my_data_product:1:my_raw_s3_port
    name: my raw s3 port
    fullyQualifiedName: My Raw S3 Port
    description: s3 raw output port
    kind: outputport
    version: 1.0.1
    infrastructureTemplateId: microservice-id-1
    useCaseTemplateId: template-id-1
    dependsOn: []
    platform: CDP on AWS
    technology: s3_cdp
    outputPortType: Files
    creationDate: 05-04-2022T16:53:00.000Z
    startDate:
    processDescription: this output port is generated by a Spark Job scheduled every day at 2AM and it lasts for approx 2 hours
    dataContract:
      schema: []
      SLA:
        intervalOfChange: 1 hours
        timeliness: 1 minutes 
        upTime: 99.9%
      termsAndConditions: only usable in development environment
      endpoint: https://myurl/development/my_domain/my_data_product/1.0.0/my_raw_s3_port
    dataSharingAgreements:
      purpose: this output port want to provide a rich set of profitability KPIs related to the customer
      billing: 5$ for each full scan
      security: In order to consume this output port an additional security check with compliance must be done
      intendedUsage: the dataset is huge so it is recommended to extract maximum 1 year of data and to use these KPIs in the marketing or sales domain, but not for customer care
      limitations: is not possible to use this data without a compliance check
      lifeCycle: the maximum retention is 10 years, and eviction is happening on the first of january
      confidentiality: if you want to store this data somewhere else, PII columns must be masked    
    tags:
      - tagFQN: experimental
        source: Tag
        labelType: Manual
        state: Confirmed
      - tagFQN: structured
        source: Tag
        labelType: Manual
        state: Confirmed
    sampleData: {}
    semanticLinking: {}
    specific:
      directory: history
      bucket: ms-datamesh-s3
  - id: urn:dmb:cmp:my_domain:my_data_product:1:my_view_impala_port
    name: my view impala port
    fullyQualifiedName: My View Impala Port
    description: impala view output port
    kind: outputPort
    version: 1.1.0
    infrastructureTemplateId: microservice-id-2
    useCaseTemplateId: template-id-2
    dependsOn: [urn:dmb:cmp:my_domain:my_data_product:1:my_raw_s3_port]
    platform: CDP on AWS
    technology: impala_cdp
    outputPortType: SQL
    creationDate: 05-04-2022T17:00:00.000Z
    startDate:
    processDescription:
    dataContract:
      schema:
        - name: employeeId
          dataType: string
          description: global addressable identifier for an employee.
          constraint: PRIMARY_KEY
          tags:
            - tagFQN: GlobalAddressableIdentifier
              source: Tag
              labelType: Manual
              state: Confirmed
        - name: first_name
          dataType: string
          description: employee's first name
          constraint: NOT_NULL
          tags:
            - tagFQN: PII
              source: Tag
              labelType: Manual
              state: Confirmed
        - name: last_name
          dataType: string
          description: employee's last name
          constraint: NOT_NULL
          tags:
            - tagFQN: PII
              source: Tag
              labelType: Manual
              state: Confirmed
        - name: birthdate
          dataType: date
          description: employee's birthdate
          constraint: NOT_NULL
          tags: []
        - name: gender
          dataType: string
          description: employee's gender
          constraint: NOT_NULL
          tags: []
        - name: residential_address
          dataType: struct
          description: employee's residential address
          constraint: NOT_NULL
          tags:
            - tagFQN: PII
              source: Tag
              labelType: Manual
              state: Confirmed
        - name: first_hire_date
          dataType: date
          description: the date of his/her first hire in mybank. No matter is a temporary or permanent contract
          constraint: NOT_NULL
          tags: []
        - name: last_working_date
          dataType: date
          description: the last day the employee worked for mybank
          constraint: NULL
          tags: []
        - name: last_update
          dataType: date
          description: the last date the record has been updated
          constraint: NULL
          tags: []
        - name: businessTs
          dataType: timestamp
          description: the business timestamp, to be leveraged for time-travelling
          constraint: NOT_NULL
          tags: []
        - name: writeTs
          dataType: timestamp
          description: the technical (write) timestamp, to be leveraged for time-travelling
          constraint: NOT_NULL
          tags: []
      SLA:
        intervalOfChange: 1 hours
        timeliness: 1 minutes
        upTime: 99.9%
      termsAndConditions: only usable in development environment
      endpoint: https://myurl/development/my_domain/my_data_product/1.0.0/my_raw_s3_port
      biTempBusinessTs: businessTs
      biTempWriteTs: writeTs
    dataSharingAgreements:
      purpose: this output port want to provide a rich set of profitability KPIs related to the customer
      billing: 5$ for each full scan
      security: In order to consume this output port an additional security check with compliance must be done
      intendedUsage: the dataset is huge so it is recommended to extract maximum 1 year of data and to use these KPIs in the marketing or sales domain, but not for customer care
      limitations: is not possible to use this data without a compliance check
      lifeCycle: the maximum retention is 10 years, and eviction is happening on the first of january
      confidentiality: if you want to store this data somewhere else, PII columns must be masked    
    tags: []
    sampleData:
      columns:
        - name
        - surname
      rows:
        - - Jace
          - Beleren
        - - Gideon
          - Jura
        - - Chandra
          - Nalaar
    semanticLinking: {}
    specific:
      database: my_database
      table: my_table
      location: /my_path
      schema:
        firstName: string
        lastName: string
      format: PARQUET
  - id: urn:dmb:cmp:my_domain:my_data_product:1:my_spark_workload
    name: my spark workload
    fullyQualifiedName: My Spark workload
    description: spark batch workload
    kind: workload
    version: 1.1.1
    infrastructureTemplateId: microservice-id-3
    useCaseTemplateId: template-id-3
    platform: CDP on AWS
    technology: spark
    workloadType: batch
    connectionType: DataPipeline
    tags: []
    readsFrom: [urn:dmb:ex:mainframe_db2_database]
    specific:
      artifactory: ms-datamesh-s3
      artefact: /path/to/my/spark/workload.jar
      service: my_cdp_service
      cluster: my_cde_cluster
      className: com.mycompany.MySparkApp
      args:
       - arg1
       - arg2
      driverCores: 1
      driverMemory: 4g
      executorCores: 4
      executorMemory: 4g
      numExecutors: 3
      schedule:
        cronExpression: 0 0 0,22 ? * * *
  - id: urn:dmb:cmp:my_domain:my_data_product:1:my_observability
    name: my observability
    fullyQualifiedName: My Observability
    description: observability for my data product
    kind: observability
    infrastructureTemplateId: microservice-id-4
    useCaseTemplateId: template-id-4
    version: 1.1.1
    endpoint: http://develop/my_domain/my_data_product/1.0.0/obs
    completeness:
    dataProfiling:
    freshness:
    availability:
    dataQuality:
    specific:
      restApiName: obs_api
      stageName: data_mesh
      bucket: ms-datamesh-s3
      obsEndpoint:
        - artifact: path/to/my/obs_dq.jar
          handler: com.mycompany.MyHandler::handleRequest
          lambdaname: my_data_product_obs_dq
          awsResourceName: my_data_product_obs_dq
          awsResourcePath: /data_quality
        - artifact: path/to/my/obs_workload.jar
          handler: com.mycompany.MyHandler::handleRequest
          lambdaname: my_data_product_obs_workload
          awsResourceName: my_data_product_obs_workload
          awsResourcePath: /workload       

Source: https://github.com/agile-lab-dev/Data-Product-Specification/blob/main/example.yaml

Consequences

Data product owners need to create and curate a Data Product Specification YAML file.
A data product specification can help to define the characteristics, behavior, requirements, and usage of data products in a data mesh.
YAML only
A dataContract has a one-to-one relationship to the output port. It cannot be used to describe bilateral data contracts between data product provider and consumers.
Currently, there is no tooling support available in our data platform nor data catalog.
Licensed under Apache License 2.0
We accept to be early adaptor. We are not aware of any other company using this specification.

Automation

The YAML file needs to be curated manually by the domain team when developing and maintaining a data product.
It may be a foundation for further automation through the data platform.

This site is open source. Improve this page.