Domain-specific metadata schema development

status: in review

Introduction

To find a dataset in the National Health Data Catalogue, the dataset needs to be described well. The information that describes the dataset is called ‘metadata’; the way you structure the metadata and the terms you use is called the ‘metadata schema’. The Health-RI core metadata schema, based on DCAT-AP 3.0 and soon to include extensions like HealthDCAT-AP, DCAT-AP NL, and extra classes for Project and Study, provides a basic framework for describing health resources. It helps users find resources using general details like health themes, participant numbers, and age ranges. However, since it is designed to be minimal, it may need extra elements for specific domains. For example, in imaging datasets, the core metadata can show that imaging data is included, but it doesn’t provide specific details like the type of contrast agent, imaging technique, or the part of the body being imaged. These extra details are important for meeting more specialized needs.

To improve disciplinary dataset discovery, we may need domain-specific metadata (see also recommendation 4.4 of the Research Data Alliance document on dataset discoverability).

This document serves as a guide for working groups of different domains to develop their own domain-specific metadata schemas for the National Health Data Catalogue if they find the core metadata schema doesn’t cover important elements in their field. The guide is described in process steps (see figure below), first building the team, then collecting requirements from the domain and finally turning this into domain-specific metadata schemas that are in line with and extend the HRI core metadata schema. The respective process steps are described in more detail in the subpages with deliverables and examples.

We encourage working groups to provide active feedback on the process, including what worked, what didn’t, and any additional steps that may be needed. Per step we will collect examples or prototypes of produced artifacts.

When the term “dataset” is used, it refers specifically to a dataset as defined in the context of the National Health Data Catalogue, which is based on the DCAT Dataset class.

Audience

Intended audience of this document are working groups of different domains who would like to develop their own domain-specific metadata schema (or ‘petal’) and need guidance navigating the process.

Scope and schema considerations

  • For the development of domain-specific metadata schema the focus is on metadata of datasets; discoverability of datasets (or in general, resources) in the National Health Data Catalogue, and information (metadata) on how to access or reuse these datasets. To a large extent, these discoverability aspects may be covered by DCAT-AP and HealthDCAT-AP as implemented by the HRI core metadata schema. However, after reviewing or applying the core metadata schema, domains may identify additional needs or requests for improving the discoverability of datasets within their specific fields, which can be addressed through a domain-specific metadata schema.

  • Currently the semantic modeling of the data itself - metadata of data or so-called data modeling (e.g. modeling descriptions and relations between variables, values and records in a dataset) - is out of scope, even though it may largely follow the same steps. This kind of data modeling (and the process) will be picked up in plateau 3 (2025).

  • The use of RDF (Resource Description Framework) provides a way to represent metadata in machine-readable format and facilitates the reuse of existing vocabularies and ontologies, which ensures interoperability across different domains (see I1 of the FAIR guiding principles). By using well established standards like DCAT-AP, FOAF and SKOS, you can describe datasets in a way that is consistent with our core HRI metadata schema as well as other domains.

  • While it is highly encouraged to reuse as many of the already existing terms as possible, RDF also allows the creation of custom properties that meet specific needs of your domain, if no suitable terms exist.

Prerequisites before starting to work on the domain specific metadata

  • The domain is defined and organized in such a way that a metadata taskforce of that domain has the possibility and mandate to speak for and make decisions with that domain about a domain-specific metadata schema. In general, we consider data source domains (e.g., omics, imaging, clinical data) and disease domains (e.g., oncology, cardiovascular, rare diseases) whether or not further subdivided into subdomains.

  • It is highly preferred that the domain or working group is already familiar with the Health-RI core metadata schema and has attempted to map a dataset from the specific domain to the core. This helps identify whether additional, more specific elements are needed to extend the schema.

Process overview

image-20240809-151944.png
Figure 1. Schematic overview of the process. Steps 5 and 6 are in light grey because these are more about the implementation (into the catalogue) and governance processes and less about the modeling process.

Although depicted as a linear, sequential process, the process can be much more nonlinear. The steps serve as a guide to the activities you carry out and may run in parallel. Agreeing on definitions, modeling the semantics, and getting community endorsement can be very cumbersome, so working on a schema through repeated cycles (iterative) and starting small (incremental) may be more efficient than trying to be perfect and complete from the start.

Timelines and outreach

Below a schematic overview of how a timeline and how to involve your domain (see step 7) may look like.

 

image-20241129-154734.png
Figure 2. Example timeline with model releases and domain involvement.

 

How long this process takes depends on several factors, for example the capacity of your metadata taskforce, how well your domain is organized, what the scope is of your schema, how much modeling work has been done already in your domain, and how you work with your metadata taskforce. For collecting requirements and for reviewing you often have to give your domain some weeks to respond. It may be worthwhile to consider design (sprint) approaches where you block several days in row with your taskforce to work only on the model (see for example here) instead of spreading meetings over the months.

Contributors and contributing

Authors

  • Ana Konrad - Health-RI

  • Niek van Ulzen - Health-RI

Reviewers

  • Alexander Harms - Erasmus UMC

  • Jolanda Strubel - Health-RI

  • Lucie Kulhankova- Health-RI

  • Rob Hooft - Health-RI

  • XiaoFeng Liao - Health-RI

Contributing

There are different ways in which you can get involved in developing this method and these pages, ranging from minimal to maximal involvement:

  • Giving Feedback. If you have an idea, miss something, encounter (textual) errors or would like to compliment on the work, you can use the Confluence comments or send us an email.

  • Providing examples / prototypes. If you have good examples or prototypes of artifacts per step (such as a requirements document, a list of competency queries, a list of properties and their definitions, a UML diagram of the conceptual or semantic model, etc.) we are looking forward to add it to this Confluence page. To make it a useful / comprehensible example for others, some polishing might be needed.

  • Reviewing. If you would like to review the text of one or more steps, then please contact us.

  • Co-Writing. If you would like to (co-)write one or more steps, then please contact us.

Ideally, this documentation is developed in parallel with the actual schema development so that we can learn from practice and adapt the process (steps) accordingly, but we encourage and value any type of contribution.

Sources and further reading

The process and steps are partly based on:

Questions?

If you have any further questions not addressed in the process description, please reach out to Health-RI Servicedesk | Health-RI

servicedesk@health-ri.nl