4A Metadata mapping

STATUS: IN DEVELOPMENT

📌 Introduction

Before you can add your resource’s metadata to the National Health Data Catalogue, you will need to know what metadata are, where your metadata are located and what metadata is needed for the Catalogue. Independent of how you will add your metadata to a FAIR Data Point (manually or automatically), you will need to map your metadata values to the Catalogue’s metadata schema.

In this section, we describe the basics of metadata and explain how to map your metadata to the Health-RI core metadata schema.

The information on this page is meant for professionals with or without experience in metadata, metadata mapping or semantic modeling, who want to add their metadata to the National Health Data Catalogue. Additional background information is presented to give you the choice to learn at a “need to know” basis or dive deeper into the background of the metadata mapping process.

If you already understand the schema and want to go to metadata mapping immediately, you can follow this tutorial: https://health-ri.atlassian.net/wiki/spaces/FSD/pages/290291734/Mapping+tutorial?atlOrigin=eyJpIjoiNjZjNmYzNDczMThmNGQyMDgzZTQ3ODg0ODAxZTAyNWUiLCJwIjoiYyJ9.

In the future, we hope to support you with scripts that can automatically transform metadata entered in a CSV template into RDF, ready to be added to the FDP.

🧠 What is metadata

Metadata is essentially data about data. It provides information that describes various aspects of your data, such as its description, the owner of the data, or the format of the data. In other words, metadata helps understanding and managing data effectively by providing additional information about it.

Specifically for the National Health Data Catalogue, based on the provided metadata, users of the catalogue will find relevant datasets and judge their usability. Therefore, as a data holder onboarding data, it is essential to provide detailed and complete metadata about your dataset(s). That way, you also adhere to the F2 of the FAIR principles. If the metadata contains the right information, eg. about the type of cancer that is relevant in a dataset, a data user will be able to find relevant and interesting datasets in the catalogue.

A metadata standard is a set of rules, guidelines and conventions that define how metadata should be structured, formatted and described within a particular domain or context. Adhering to such standards ensures consistency, interoperability and effective management of metadata across different systems, organizations and disciplines.

To find out more on where you can find metadata for your resource go to the Metroline Step: Assess availability of your metadata .

🎯 Health-RI Metadata Schema

The National Health Data Catalogue currently uses a Core Metadata Schema: a set of minimal elements for describing each resource (e.g. dataset) with common metadata. It defines the requirements to find and reuse information across Health-RI nodes via the National Catalogue.

The Health-RI Metadata Schema is based on universally used metadata standards such as DCAT-AP, DCAT-AP NL and HealthDCAT-AP.

The first version of the Core Metadata Schema is based on DCAT-AP v3. V2 of the Health-RI core metadata schema also incorporates the (draft) HealthDCAT-AP and applies restrictions as defined in DCAT-AP-NL (Dutch DCAT-AP specification). You can find more information on the relation of the Health-RI metadata schema to other application profiles here.

Where do I find detailed descriptions of the core metadata schema?

For specific details on the schema, please visit the Github specifications dedicated for data experts or data stewards: Currently, we are transferring to a new version of the metadata schema: v2, available on Github here. Specifications from the official v1 release are available here.

🧩 Of which elements does the metadata schema consist?

The schema consists of classes and properties. Classes are the main entities describing the data, such as “Dataset”. Each class has a number of properties (related metadata fields) that specify the class further and each property has specific attributes such as “range”.
These classes, properties and their attributes are visualized in a UML (Unified Modeling language) diagram. Below you find more detailed explanations for these elements.

Classes

The core metadata schema is split into several classes. Classes are the main entities describing the data, which can be used to represent the overall structure/context of the metadata describing datasets. Each class is described using a URI (unifrom resource identifier) consisting of the vocabulary (e.g. dcat) and the class name in that vocabulary (e.g. Dataset).
For example, all datasets (dcat:Dataset class) of an institute can be grouped under the dcat:Catalog, where the dcat:Catalog contains information about the institute (e.g. a Radboudumc catalog being the umbrella of all Radboudumc datasets), where individual instances of dcat:Dataset describe the individual datasets published by the institute.

At the moment four classes (dcat:Dataset, dcat:Catalog, vcard:Kind and foaf:Agent) are mandatory in the Health-RI metadata model. The other classes (such as disco:Study) are not strictly necessary to onboard data to the National Health Data Catalogue, but using them can be beneficial to provide meaningful context of a dataset. For example, datasets can be listed alone, but can be associated with a (research) Project (foaf:Project) and Study (disco:Study) to highlight the context in which the datasets were established.

We also discriminate between main classes, like dcat:Dataset and dcat:Catalog, and supporting classes, like foaf:Agent and vcard:Kind. The latter describe certain attributes (e.g. contact details in the case of vcard:Kind) with each own set of metadata elements (properties).

For an overview of the classes in metadata core v2 and their relations, see the figure below. Additionally, we provide some considerations and guidelines on mapping to the different classes here: https://health-ri.atlassian.net/wiki/spaces/FSD/pages/1020624897/Recommendations+on+mapping+to+classes+in+the+v2+core+metadata?atlOrigin=eyJpIjoiZjcxMGNkZmFjNzk1NDQ0MDliMDE2ODE4NTRiZjhmZjkiLCJwIjoiYyJ9

Overview of all core Health-RI classes and relations between classes

👇 How are the different classes related to each other? Expand the section below to find out! 👇

All classes and their relations

Below we describe all possible relations between the main classes of the v2 Health-RI metadata schema.
For each connection, we describe from which class, via which property it is connected to another class.

For example, dcat:Catalog → dcat:dataset → dcat:Dataset means: the dcat:Catalog class has property dcat:dataset with range dcat:Dataset. In other words, the catalog class points to the dataset class via the dcat:dataset property.
It is very likely you will not make use of all of these connections, but for the sake of completeness, we have described them all here above.

Main connections
- dcat:Catalog → dcat:dataset → dcat:Dataset
  Establishes the connection between a catalog and a dataset in that catalog.
- dcat:Dataset → dcat:distribution → dcat:Distribution
  Connection between dataset and its distribution.
In case your dataset is part of a series or project:
- dcat:Dataset → dcat:inSeries→ dcat:DatasetSeries
  Connection between dataset and a dataset series it belongs to. Different datasets from the same series will point to the same instance of dcat:DatasetSeries.
- dcat:Dataset → prov:wasGeneratedBy → disco:Study
  Connection between a dataset and Study, in which data generation/collection are described.
- disco:Study → dct:isPartOf → foaf:Project
  Connection between a study and project of which study is part of.
Dataset to another Dataset
- dcat:Dataset → dct:source → dcat:Dataset
  If a dataset is based on another dataset, this is used to reference to the source dataset.
- dcat:Dataset → dct:hasVersion→ dcat:Dataset
  Reference to another version of the same dataset.
Data Service to other classes
- dcat:Catalog → dcat:service → dcat:DataService
  Connection between a catalog and data service.
- dcat:DataService→ dcat:servesDataset → dcat:Dataset
  Reference of between a data service and the dataset it serves.
Catalog to another Catalog
- dcat:Catalog → dcat:catalog → dcat:Catalog
  Connection between related catalogs.
- dcat:Catalog → dct:hasPart → dcat:Catalog
  Establishing nested catalogs.
Special Distributions of a Dataset (both are introduced by HealthDCAT-AP).
- dcat:Dataset → healthdcatap:analytics → dcat:Distribution
  Relation to analytics distribution of a dataset. More information available here.
- dcat:Dataset → adms:sample → dcat:Distribution
  Relation to samle distribution of a dataset. More information available here.

Note that you will most likely NOT need or make use of all available classes in the Health-RI core metadata schema (v2). Some classes are not applicable to all cases, e.g. in a case where an institute wants to describe only the available datasets, they might only use the dcat:Catalog and dcat:Dataset classes.
More information and considerations/guidelines for different use cases are described here: https://health-ri.atlassian.net/wiki/spaces/FSD/pages/1020624897/Recommendations+on+mapping+to+classes+in+the+v2+core+metadata?atlOrigin=eyJpIjoiZjcxMGNkZmFjNzk1NDQ0MDliMDE2ODE4NTRiZjhmZjkiLCJwIjoiYyJ9

Properties

Each class consists of a set of its own, related metadata fields, so called properties, that describe the entity (class) in more detail. For example, each dcat:Dataset contains the properties dct:title and dct:description, which are free text fields that provide a title and detailed description of the contents of the dataset. In another example, the class vcard:Kind (which is used to provide contact details of a resource) contains the property vcard:hasEmailto provide an email address in the metadata.
Each property has a number of attributes (i.e. requirement level, cardinality, range, property URI):

Requirement level: Each property has a requirement level, indicating whether it is mandatory, recommended or optional to fill this property in the respective class. Mandatory properties must always filled. Recommended properties should be filled if the information is available. Optional properties can be filled, but are not always available or applicable.
Cardinality: Each property has a cardinality, that further specifies the requirement level. Cardinalities are expressed with integers (e.g. 0..1). The first integer indicates how many times the property has to be filled at minimum, the second indicates the maximum. The most commonly occurring cardinalities are:
- 0..n (also written as 0..*): The property is not mandatory, but can be filled many times.
- 0..1: The property is not mandatory, but may only be filled once at most.
- 1..n (also written as 1..*): The property must be filled (is mandatory), and can be filled many times.
- 1..1: The property must be filled once (is mandatory), but only once.
Range/format: For each property, it is specified how it should be filled, specifying its range. This determines the format of the filled value per property, for example whether the property is to be filled with free text (rdfs:Literal), a date in a specific format (xsd:dateTime), or point to another class (for example, the range of dcat:service property in dcat:Catalog is dcat:DataService, establishing the connection between instances of the two classes via its IRI (Internationalized Resource Identifier)).
Controlled vocabularies: A number of properties have to be filled with values from so-called controlled vocabularies, a specific list of pre-defined values that can be linked to. For example, the property access rights in the Dataset class restricts the range to three specific values from a EU-controlled vocabulary for access rights. In the Health-RI model, we have added the relevant link to the controlled vocabulary to the description of the respective properties.
Properties connecting classes: Classes also contain a specific set of properties that connect one class to another. For example, in the dcat:Catalog class of the Health-RI core, the dcat:dataset property establishes the connection between a catalogue and a dataset it contains.
Like other properties, these connecting properties have a requirement level (in our example, mandatory), cardinality (in our example, 1..n, meaning that each dcat:Catalog has to contain at least one dcat:Dataset), and a specified range (in this example, the property dcat:dataset has the range dcat:Dataset, indicating that this property in the dcat:Catalog class points to an instance of a dataset, via the IRI of the dcat:Dataset).
Below, you find an overview of all classes of the v2 core metadata schema of Health-RI with all possible relations between classes. Jump right to it here.
Property URI: Each property is attributed with an URI (uniform resource identifier), clearly identifying the element and parent ontology from which the property is derived. For example, the property 'Contact point' in the class Catalogue, has the property URI dcat:contactPoint, indicating that concerns the property contact point derived from DCAT vocabulary.
Note that property URIs always start with a small letter, like dcat:contactPoint, while class URIs start with a capital letter, like dcat:Dataset.
Definitions and usage notes: each property has a definition that further specifies the property, as well as a usage note, which describes in more detail how the property should be used. Definitions and usage notes of de v2 Health-RI core metadata schema are available on Github and in the associated Excel sheet.

By providing the metadata of all mandatory (and ideally also recommended) properties of required classes in the schema in the correct format, a data holder makes sure that the metadata conforms to the schema and is machine-and human-readable.

UML diagram

A UML diagram is a visual representation of a metadata schema. The UML of the v2 metadata schema of Health-RI is depicted below.
A UML is divided by class (the boxes in the UML below), where each box represents a class of the schema. Within each class, the relevant properties are listed with the property URI, the range, requirement level and cardinality.
For example, in the UML below you see the box for dcat:Dataset (class), containing the mandatory property dct:title with range rdfs:Literal and cardinality [1..n]. The dcat:Dataset class also contains the property dcat:distribution with range dcat:Distribution (cardinality [0..n]). As you can see from the capital letter in the range of the property, this property is pointing to another class (dcat:Distribution) also present in the UML. The connection between these classes is also indicated by the open arrow from the dcat:Dataset class to the dcat:Distribution class.
While open arrows indicate connections between classes, closed arrows indicate that a certain class inherits all properties from another class. For example, the dcat:Dataset inherits from dcat:Resource, indicating that all properties from dcat:Resource can also be used in dcat:Dataset. Note that this does not mean that also the values are inherited, but only the ('empty') properties.
Nested classes: It is possible that a class refers to (another instance) of the same class, e.g. dcat:Catalog pointing to itself via the property dct:hasPart. These kind of nested structures can be used to describe the structure of an institution or infrastructure in more detail, for example if an institute (described by dcat:Catalog) is divided into several independent departments (each described with its one instance of dcat:Catalog) that produce and publish their own sets of dcat:Dataset.

Please note that in the current implementation of the Health-RI core schema, there is a limit to the theoretically indefinite flexibility that DCAT offers, especially since the National Health Data Catalogue cannot currently display these layers of nested structures. Read more about it here https://health-ri.atlassian.net/wiki/spaces/FSD/pages/290291734/Mapping+tutorial#%F0%9F%9A%A7-Current-limitations-in-model-flexibility .

UML diagram of the v2 core metadata schema of Health-RI