Metadata onboarding on the National Catalogue

status: Ready for Review

 

Introduction

This document aims to guide the data onboarding process, explaining how to publish information about datasets on the National Health Data Catalogue. It's essential to note that this document is dynamic and subject to regular updates to reflect the current state of the catalogue. This documentation is intended for researchers and data holders.

Data onboarding translates into making datasets accessible on the National Health Data Catalogue. By following the steps outlined in this guide, you can ensure that your data is effectively and correctly onboarded and can be made readily available for data users.

What is the National Health Data Catalogue?

The National Health Data Catalogue is an overview of health & life sciences research data in the Netherlands. It contains metadata about the available datasets, meaning it contains the description of the datasets and other resources. This description includes, for example, a date when the dataset was created, the authors, or a URL where the data can be found. The metadata hosted within the National Health Data Catalogue are sourced from a diverse range of origins and domains. These sources can span from electronic records to images, biomaterials, omics data, collections and many more.

The goal of the National Health Data Catalogue is to create an infrastructure for secondary use of data where researchers and other interested parties can find and access cross-domain data relevant to their research. The intent is to harvest currently available data from any health-care and life science domain in the Netherlands.

 

The catalogue's goal is to foster FAIR data principles—making data Findable, Accessible, Interoperable, and Reusable for its users. To delve deeper into the concept of FAIR, you can find more information about it here.

Where does the The National Health Data Catalogue get metadata?

The Catalogue can harvest information from other catalogues, and itself can also be harvested by other catalogues. This means that, once metadata is entered in one catalogue, it automatically becomes available in other catalogues, preventing a data holder from having to enter the metadata manually in every individual catalogue. Ideally the two catalogues can be connected via a FAIR data point that holds information about the data and shares it with the Catalogue (Figure 1).

There are several ways a data holder can onboard their data to the Catalogue. However, firstly the data needs to be properly prepared and described. The Catalogue uses a Health-RI metadata schema based on DCAT v3 and DCAT AP. Currently, this metadata schema uses relatively general and overarching classes and definitions, forming the so-called Core metadata schema. Learn more about the schema here GitHub - Health-RI/health-ri-metadata at master. This core metadata schema will be further expanded in the future (see What is the future of the National Health Data Catalogue)

 

afbeelding-20240227-144157.png
Figure 1. Connection of data (source) to the National Catalogue via an FDP

 

How to onboard your metadata to the Catalogue?

There are several steps needed to publish your metadata on the National Catalogue. Here we show the basic steps. You can find some examples of onboarding and scenarios here. For technical documentation please refer to the Health-RI Github: GitHub - Health-RI/health-ri-metadata at master.

General onboarding steps

1.Request

In this step a data holder/provider reaches out to Health-RI to request onboarding of the metadata, via our service desk: servicedesk@health-ri.nl. A Health-RI contact person is assigned to the data holder/provider and the request is internally registered.

2. Intake

The Health-RI contact person requests the data holder/provider to provide information about the data and resources available. If needed a meeting can be initiated in this stage to provide more detailed information. The Heath-RI contact person also facilitates contact or aligment with other onboarding projects in the same institute or node if possible.

In this step it is also crucial to check if the FAIR pre-requisites and ELSI guidelines are followed. More on the ELSI considerations can be found here: Make sure you can publish your metadata

3. Planning

In this stage the data holder/provider explores the onboarding process and plans a strategy. The resulting strategy should be scalable and ideally usable for multiple data holder/provider when possible (ie. institute level onboarding). Here are some questions that need to be answered in this stage:

What?

What data will be onboarded?

What metadata (schema) are already in place?

What security protocols are there at the institute of the data holder/provider, that are applicable here?

What infrastructural solution exists at the institute and how can the onboarding pipeline fit?

What metadata can we share? Is there a part of the metadata schema that cannot be shared due to ELSI concerns?

How?

How will the data be exposed to the National Health Data Catalogue? Will there be an implementation of an own FAIR Data Point (FDP)? Can the metadata export be automated? Will it be manual entry?

Who?

Who will be involved in the onboarding project from the side of the data holder/provider?

Who will be responsible for maintaining the metadata?

Who will be the contact point for the data holder/provider?

Who should be informed about the onboarding project (other data holders/providers? the board? IT department?)

4. Implementation

Upon deciding on a strategy for the onboarding of the (meta)data, the data holder/provider needs to implement the plan. There are two main tasks in this stage that can be done in parallel.

a) Mapping metadata to the HRI metadata schema

In order to onboard metadata the data holder/provider needs to map their local metadata to the metadata schema. You can find general information about the metadata standards and the mapping process in the section bellow.

b) Implementing a metadata harvesting pipeline Health-RI

To expose metadata to Health-RI, an intermediate system needs to be in place. The National Health Data Catalogue is using Fair Data Points to harvest information. Basic information on FAIR Data Points can be found here: Exposing metadata

The FAIR data point should be implemented by the data holder/provider, ideally accompanied with a automated export pipeline (2.a Automate export from your local system, 2.b Example python code to upload metadata to FDP ).

There are several approaches for implementation of a FAIR Data Point:

  1. Exposing your local system: 1.a Expose your local system

  2. Implementing an FAIR Data Point using FDP in a box: 1.b FDP in a box

  3. Manually add the information about your data to the National Catalogue via a Central FDP: 1.c Central FDP

5. Harvesting

To harvest the exposed metadata the data holder/provider contacts the Health-RI service desk with an onboarding request and includes the details of the FAIR Data Point to harvest by sending an email to: servicedesk@health-ri.nl. Health-RI then performs the harvesting.
The metadata is harvested into a testing environment where a check of the data is performed by HRI and the data holder/provider, before reaching the Catalogue. The Catalogue is currently updated daily for changes in the available FDPs, so changes in the metadata can take up to 24 hours to update.

6. Onboarded

If the metadata is approved by the data holder/provider it is then onboarded to the National Health Data Catalogue. If possible, the data holder/provider is asked to share any issues and feedback to the Health-RI contact.

 

 

 

 

Onboarding_presentation_v2.png
Figure 2. Onboarding process for the National Health Data Catalogue

 

Need help with onboarding?

Feel free to join our weekly Walk-in hour, where one of our colleagues is ready to help you with any issues. To register please fill in this sign-up sheet. For information about the time, as well as the link to join, please see HRI agenda Agenda | Health-RI or contact Lucie Kulhankova. We also collect workarounds for common issues in Known issues.

What is the future of the National Health Data Catalogue?

The current version of the Catalogue allows for the general description of the data and metadata. To allow more domain specific search-ability the metadata descriptions will be expanded in the future. We can imagine the metadata as a sunflower where the core represents the common values across domains while each domain has its own petal describing the specific metadata needs of the researchers in each domain (Figure 3). The expansion of metadata will allow researchers to find data relevant to their research.

 

 

In the future, a request tool will be connected to the Catalogue. This tool will allow researchers and other users to request access to datasets they find relevant. The request will be processed and reviewed centrally in a secure environment and users will be able receive answers on their queries in case of federated analysis.

We are currently developing and in the future, we will implement metadata schema’s that allow us to share metadata from the National Health Data Catalogues with other Dutch data portals and the European Health Data portal. More information can be found here.

You can find more information about the intended structure and availabilities here. You can also follow the latest updates and developments here: Current developments .