Adding metadata from Create a Derived Table to Find MoJ data

Find MoJ data uses the Create a Derived Table service (CaDeT) as a source of metadata about the Analytical Platform. CaDeT uses a python package called dbt.

By default, all models and sources will be ingested into the Datahub catalogue, but they will not be shown in the Find MoJ data service.

Make a model visible

To make a model visible in Find MoJ data, add the dc_display_in_catalogue tag to that model. Config of CaDeT models is described in their documentation here.

For example, in dbt_project.yml you can include

models:
  courts:
    some_subdirectory:
      common_platform_derived:
        +tags:
          - dc_display_in_catalogue

This tag should be used for tables that users are expected to work with directly. Don’t add it to intermediate/staging tables.

Generated models

Some models are generated from a template. If you intend to include these generated models in the catalogue, set tags to $TAGS_WITH_DISPLAY_IN_CATALOGUE$ in the YAML template. This makes sure that the dc_display_in_catalogue tag is set in the resulting file.

For example, /model_templates/oasys/templates/models/risk/oasys/oasys__{table_name}.yml contains the following model definition:

- name: $MODEL_NAME$
  config:
    tags: $TAGS_WITH_DISPLAY_IN_CATALOGUE$

Set required metadata

When adding new entities to the catalgoue, we require that you specify some additional metadata in DBT. For example:

models:
  courts:
    +meta:
      dc_slack_channel_name: "#ask-data-engineering"
      dc_slack_channel_url: https://moj.enterprise.slack.com/archives/C8X3PP1TN
      dc_data_custodian: Joe.Bloggs

This metadata can be set at domain level, so for all tables in that domain, or individually on a per-table level.

The required fields are as follows:

field name	description	example
dc_slack_channel_name	The name of a slack channel to be used as a contact point for users of the catalogue service, including the leading ‘#’. Note: this is not the same as the owner channel for notifications.	`#data-engineering`
dc_slack_channel_url	The URL to the slack channel	`https://moj.enterprise.slack.com/archives/C8X3PP1TN`
dc_data_custodian	The Datahub user ID for the data custodian, usually in the form FirstName.LastName. This is a technical contact, not a data owner, information asset owner (IAO), or DBT owner.	`Joe.Bloggs`

Additional metadata

field name	description	example
dc_where_to_access_dataset	An enum representing how the data can be accessed by end users, eg a choice of [“AnalyticalPlatform”, “CourtsAPI”]. For DBT, this always defaults to AnalyticalPlatform.	AnalyticalPlatform

Ensure the data custodian has an account in Datahub

The user’s Datahub account must exist before you set the dc_data_custodian. This will happen automatically the first time they log into Datahub.

The user ID is visible in the URL of a user page in Datahub, e.g.

https://datahub-catalogue-dev.apps.live.cloud-platform.service.justice.gov.uk/user/urn:li:corpuser:Joe.Bloggs/owner%20of

Speak to Find MoJ data team if you would like us to manually add a set of users without them logging in.

Full example `dbt_project.yml` file

models:
  mojap_derived_tables:
    +materialized: table
    +group: default
    +meta:
      # Metadata to send Find MoJ data. Can be overriden
      # per domain/model/source
      dc_slack_channel_name: "#ask-data-modelling"
      dc_slack_channel_url: https://moj.enterprise.slack.com/archives/C03J21VFHQ9
      dc_where_to_access_dataset: AnalyticalPlatform
    bold:
      +meta:
        dc_data_custodian: jane.doe
      +group: bold
      bold_rr_pnc_ids:
      +tags:
        - bold_daily
        - dc_display_in_catalogue

Make a source visible

In CaDeT, dbt sources are not ingested. To ingest a source, contact the team and request that the source is ingested from glue.

Make a seed visible

All dbt seeds are ingested into the catalogue and shown by default.

Make a database visible

If your model has been tagged with dc_display_in_catalogue then by default it will have its parent database created in Find MoJ data, see guidance for registering a model/table

However, this database will have no associated metadata available to make it more useful to someone browsing Find MoJ data. But don’t worry you can create this metadata and this section will explain how.

Creating a metadata file for your CaDeT database

Create a branch off of latest main in the CaDeT github repoistory.

Add a new yaml file with filename as the name of your database in the folder mojap_derived_tables/database_metadata

Add your metadata to the file. The content of that file should be like the example below but with metadata relating to your specific database:

database_metadata:
  # The name of the database as it appears in prod
  name: example_derived
  # A description of the database that will aid people to understand what
  # it contains and whether it might be of use
  description: This is just an example, serves no other purpose
  # The data custodian (technical contact)
  # This should be the name as it appears before the @justice.gov.uk
  # If a functional mailbox or team email it can be given as full email
  # address
  dc_data_custodian: some.body12
  # Slack channel name for where people should direct any questions they
  # have about the data
  dc_slack_channel_name: "#ask-channel-example"
  # The url for the named channel above
  dc_slack_channel_url: "https://moj.enterprise.slack.com/archives/12"
  # A readable user friendly name for the database that can be used to
  # display in find-moj-data
  dc_readable_name: Example derived
  # A link to where a user of find-moj-data can find information on how to
  # access the data
  dc_access_requirements: "https://example-data-access.gov.uk"

Raise a pull request, when your file is complete, to have your metadata reviewed and merged to main.

When merged to main your new database metadata will be available in Find MoJ data the following day.

This page was last reviewed on 24 October 2024. It needs to be reviewed again on 24 April 2025 by the page owner #data-catalogue .