Domain-specific metadata schema development

status: in development

Introduction

To find a dataset in the National Health Data Catalogue, the dataset needs to be described well. The information that describes the dataset is called ‘metadata’; the way you structure the metadata and the terms you use is called the ‘metadata schema’. While the HRI core metadata schema, based on the DCAT-AP 3.0, addresses the fundamental elements, it might fall short in fully describing the dataset. Even after expanding this metadata schema with health related terms through integration with HealthDCAT-AP and DCAT-AP NL, it may not fully meet the needs of specific domains. To improve disciplinary dataset discovery we may need domain-specific metadata (see also recommendation 4.4 of the Research Data Alliance document on dataset discoverability).

This document serves as a guide for working groups of different domains to develop their own domain-specific metadata schemas for the National Health Data Catalogue. The guide is described in process steps (see figure below), first building the team, then collecting requirements from the domain and finally turning this into domain-specific metadata schemas that are in line with and extend the HRI core metadata schema. The respective process steps are described in more detail in the subpages with deliverables and examples.

We encourage working groups to provide active feedback on the process, including what worked, what didn’t, and any additional steps that may be needed. Per step we will collect examples or prototypes of produced artifacts.

Audience

Intended audience of this document are working groups of different domains who would like to develop their own domain-specific metadata schema (or ‘petal’) and need guidance navigating the process.

Scope and schema considerations

  • For the development of domain-specific metadata schema the focus is on metadata of datasets; discoverability of datasets (or in general, resources) in the National Health Data Catalogue, and information (metadata) on how to access or reuse these datasets. To a large extent, these discoverability aspects may be covered by DCAT-AP and HealthDCAT-AP as implemented by the HRI core metadata schema. However, domains may have additional wishes for the discoverability of datasets for their domain and that is the scope of a domain-specific metadata schema.

  • Currently the semantic modeling of the data (points) itself - metadata of data or so-called data modeling (e.g. modeling descriptions and relations between variables, values and records in a dataset) - is out of scope, even though it may largely follow the same steps. This kind of data modeling (and the process) will be picked up in plateau 3 (2025).

  • The use of RDF (Resource Description Framework) provides a way to represent metadata in machine-readable format and promotes the reuse of existing vocabularies and ontologies, which ensures interoperability across different domains (see I1 of the FAIR guiding principles). By using well established standards like DCAT-AP, FOAF and SKOS, you can describe datasets in a way that is consistent with our core HRI metadata schema as well as other domains.

  • While it is highly encouraged to reuse as many of the already existing terms as possible, RDF also allows the creation of custom properties that meet specific needs of your domain, if no suitable terms exist.

Prerequisites

  • The domain is defined and organized in such a way that a metadata taskforce of that domain has the possibility and mandate to speak for and make decisions with that domain about a domain-specific metadata schema. In general, we consider data source domains (e.g., omics, imaging, clinical data) and disease domains (e.g., oncology, cardiovascular, rare diseases) whether or not further subdivided into subdomains.

  • Developing the domain-specific metadata schema is primarily the responsibility of the domain with consult and support from the Health-RI hub. Implementation of the schemas is a shared responsibility between the domain and the hub.

Process overview

image-20240809-151944.png
Figure 1. Schematic overview of the process. Steps 5 and 6 are in light grey because these are more about the implementation (into the catalogue) and governance processes and less about the modeling process.

Although depicted as a linear, sequential process, the process can be much more nonlinear. The steps serve as a guide to the activities you carry out and may run in parallel. Agreeing on definitions, modeling the semantics, and getting community endorsement can be very cumbersome, so working on a schema through repeated cycles (iterative) and starting small (incremental) may be more efficient than trying to be perfect and complete from the start.

Timelines

[Add picture that Hannah uses + explain alternative design sprint approach].

Contributors and contributing

Authors

  • Ana Konrad - Health-RI

  • Niek van Ulzen - Health-RI

Reviewers

  • Alexander Harms - Erasmus UMC

  • Jolanda Strubel - Health-RI

  • Lucie Kulhankova- Health-RI

  • Rob Hooft - Health-RI

  • XiaoFeng Liao - Health-RI

Contributing

There are different ways in which you can get involved in developing this method and these pages, ranging from minimal to maximal involvement:

  • Giving Feedback. If you have an idea, miss something, encounter (textual) errors or would like to compliment on the work, you can use the Confluence comments or send us an email.

  • Providing examples / prototypes. If you have good examples or prototypes of artifacts per step (such as a requirements document, a list of competency queries, a list of properties and their definitions, a UML diagram of the conceptual or semantic model, etc.) we are looking forward to add it to this Confluence page. To make it a useful / comprehensible example for others, some polishing might be needed.

  • Reviewing. If you would like to review the text of one or more steps, then please contact us.

  • Co-Writing. If you would like to (co-)write one or more steps, then please contact us.

Ideally, this documentation is developed in parallel with the actual schema development so that we can learn from practice and adapt the process (steps) accordingly, but we encourage and value any type of contribution.

Sources and further reading

The process and steps are partly based on: