STATUS: IN DEVELOPMENT
đź“Ś Introduction
Before you can add your resource’s metadata to the National Health Data Catalogue, you will need to know what metadata are, where your metadata are located and what metadata is needed for the Catalogue. Independent of how you will add your metadata to a FAIR Data Point (manually or automatically), you will need to map your metadata values to the Catalogue’s metadata schema.
In this section, we describe the basics of metadata and explain how to map your metadata to the Health-RI core metadata schema.
The information on this page is meant for professionals with or without experience in metadata, metadata mapping or semantic modeling, who want to add their metadata to the National Health Data Catalogue. Additional background information is presented to give you the choice to learn at a “need to know” basis or dive deeper into the background of the metadata mapping process.
If you already understand the schema and want to go to metadata mapping immediately, you can follow this tutorial: https://health-ri.atlassian.net/wiki/spaces/FSD/pages/290291734/Mapping+pipeline .
🧠What is metadata
Metadata is essentially data about data. It provides information that describes various aspects of your data, such as its description, the owner of the data, or the format of the data. In other words, metadata helps understanding and managing data effectively by providing additional information about it.
Specifically for the National Health Data Catalogue, based on the provided metadata, users of the catalogue will find relevant datasets and judge their usability. Therefore, as a data holder onboarding data, it is essential to provide detailed and complete metadata about your dataset(s). That way, you also adhere to the F2 of the FAIR principles. If the metadata contains the right information, eg. about the type of cancer that is relevant in a dataset, a data user will be able to find relevant and interesting datasets in the catalogue.
A metadata standard is a set of rules, guidelines and conventions that define how metadata should be structured, formatted and described within a particular domain or context. Adhering to such standards ensures consistency, interoperability and effective management of metadata across different systems, organizations and disciplines.
To find out more on where you can find metadata for your resource go to the Metroline Step: Assess availability of your metadata .
🎯 Health-RI Metadata Schema
The National Health Data Catalogue currently uses a Core Metadata Schema: a set of minimal elements for describing each resource (e.g. dataset) with common metadata. It defines the requirements to find and reuse information across Health-RI nodes via the National Catalogue.
The Health-RI Metadata Schema is based on universally used metadata standards such as DCAT-AP, DCAT-AP NL and HealthDCAT-AP.
The first version of the Core Metadata Schema is based on DCAT-AP v3. V2 of the Health-RI core metadata schema also incorporates the (draft) HealthDCAT-AP and applies restrictions as defined in DCAT-AP-NL (Dutch DCAT-AP specification). You can find more information on the relation of the Health-RI metadata schema to other application profiles here.
Where do I find detailed descriptions of the core metadata schema?
For specific details on the schema, please visit the Github specifications dedicated for data experts or data stewards: Currently, we are transferring to a new version of the metadata schema: v2, available on Github here. Specifications from the official v1 release are available here.
🧩 Of which elements does the metadata schema consist?
The schema consists of classes and properties. Classes are the main entities describing the data, such as “Dataset”. Each class has a number of properties (related metadata fields) that specify the class further and each property has specific attributes such as “range”.
These classes, properties and their attributes are visualized in a UML (Unified Modeling language) diagram. Below you find more detailed explanations for these elements.
Classes
The core metadata schema is split into several classes. Classes are the main entities describing the data, which can be used to represent the overall structure/context of the metadata describing datasets. Each class is described using a URI (unifrom resource identifier) consisting of the vocabulary (e.g. dcat) and the class name in that vocabulary (e.g. Dataset).
For example, all datasets (dcat:Dataset
class) of an institute can be grouped under the dcat:Catalog
, where the dcat:Catalog
contains information about the institute (e.g. a Radboudumc catalog being the umbrella of all Radboudumc datasets), where individual instances of dcat:Dataset
describe the individual datasets published by the institute.
At the moment four classes (dcat:Dataset
, dcat:Catalog
, vcard:Kind
and foaf:Agent
) are mandatory in the Health-RI metadata model. The other classes (such as disco:Study
) are not strictly necessary to onboard data to the National Health Data Catalogue, but using them can be beneficial to provide meaningful context of a dataset. For example, datasets can be listed alone, but can be associated with a (research) Project (foaf:Project
) and Study (disco:Study
) to highlight the context in which the datasets were established.
We also discriminate between main classes, like dcat:Dataset
and dcat:Catalog
, and supporting classes, like foaf:Agent
and vcard:Kind
. The latter describe certain attributes (e.g. contact details in the case of vcard:Kind
) with each own set of metadata elements (properties).
For an overview of the classes in metadata core v2 and their relations, see the figure below. Additionally, we provide some considerations and guidelines on mapping to the different classes on this Confluence page.
Overview of all core Health-RI classes and relations between classes
👇 How are the different classes related to each other? Expand the section below to find out! 👇
Note that you will most likely NOT need or make use of all available classes in the Health-RI core metadata schema (v2). Some classes are not applicable to all cases, e.g. in a case where an institute wants to describe only the available datasets, they might only use the dcat:Catalog
and dcat:Dataset
classes.
More information and considerations/guidelines for different use cases are described here.
Properties
Each class consists of a set of its own, related metadata fields, so called properties, that describe the entity (class) in more detail. For example, each dcat:Dataset
contains the properties dct:title
and dct:description
, which are free text fields that provide a title and detailed description of the contents of the dataset. In another example, the class vcard:Kind
(which is used to provide contact details of a resource) contains the property vcard:hasEmail
to provide an email address in the metadata.
Each property has a number of attributes (i.e. requirement level, cardinality, range, property URI):
Requirement level: Each property has a requirement level, indicating whether it is mandatory, recommended or optional to fill this property in the respective class. Mandatory properties must always filled. Recommended properties should be filled if the information is available. Optional properties can be filled, but are not always available or applicable.
Cardinality: Each property has a cardinality, that further specifies the requirement level. Cardinalities are expressed with integers (e.g. 0..1). The first integer indicates how many times the property has to be filled at minimum, the second indicates the maximum. The most commonly occurring cardinalities are:
0..n (also written as 0..*): The property is not mandatory, but can be filled many times.
0..1: The property is not mandatory, but may only be filled once at most.
1..n (also written as 1..*): The property must be filled (is mandatory), and can be filled many times.
1..1: The property must be filled once (is mandatory), but only once.
Range/format: For each property, it is specified how it should be filled, specifying its range. This determines the format of the filled value per property, for example whether the property is to be filled with free text (
rdfs:Literal
), a date in a specific format (xsd:dateTime
), or point to another class (for example, the range ofdcat:service
property indcat:Catalog
isdcat:DataService
, establishing the connection between instances of the two classes via its IRI (Internationalized Resource Identifier)).Controlled vocabularies: A number of properties have to be filled with values from so-called controlled vocabularies, a specific list of pre-defined values that can be linked to. For example, the property access rights in the Dataset class restricts the range to three specific values from a EU-controlled vocabulary for access rights. In the Health-RI model, we have added the relevant link to the controlled vocabulary to the description of the respective properties.
Properties connecting classes: Classes also contain a specific set of properties that connect one class to another. For example, in the
dcat:Catalog
class of the Health-RI core, thedcat:dataset
property establishes the connection between a catalogue and a dataset it contains.
Like other properties, these connecting properties have a requirement level (in our example, mandatory), cardinality (in our example, 1..n, meaning that eachdcat:Catalog
has to contain at least onedcat:Dataset
), and a specified range (in this example, the propertydcat:dataset
has the rangedcat:Dataset
, indicating that this property in the dcat:Catalog class points to an instance of a dataset, via the IRI of the dcat:Dataset).
Below, you find an overview of all classes of the v2 core metadata schema of Health-RI with all possible relations between classes. Jump right to it here.Property URI: Each property is attributed with an URI (uniform resource identifier), clearly identifying the element and parent ontology from which the property is derived. For example, the property 'Contact point' in the class Catalogue, has the property URI
dcat:contactPoint
, indicating that concerns the property contact point derived from DCAT vocabulary.
Note that property URIs always start with a small letter, likedcat:contactPoint
, while class URIs start with a capital letter, likedcat:Dataset
.Definitions and usage notes: each property has a definition that further specifies the property, as well as a usage note, which describes in more detail how the property should be used. Definitions and usage notes of de v2 Health-RI core metadata schema are available on Github and in the associated Excel sheet.
By providing the metadata of all mandatory (and ideally also recommended) properties of required classes in the schema in the correct format, a data holder makes sure that the metadata conforms to the schema and is machine-and human-readable.
UML diagram
A UML diagram is a visual representation of a metadata schema. The UML of the v2 metadata schema of Health-RI is depicted below.
A UML is divided by class (the boxes in the UML below), where each box represents a class of the schema. Within each class, the relevant properties are listed with the property URI, the range, requirement level and cardinality.
For example, in the UML below you see the box for dcat:Dataset
(class), containing the mandatory property dct:title
with range rdfs:Literal
and cardinality [1..n]. The dcat:Dataset
class also contains the property dcat:distribution
with range dcat:Distribution
(cardinality [0..n]). As you can see from the capital letter in the range of the property, this property is pointing to another class (dcat:Distribution
) also present in the UML. The connection between these classes is also indicated by the open arrow from the dcat:Dataset
class to the dcat:Distribution
class.
While open arrows indicate connections between classes, closed arrows indicate that a certain class inherits all properties from another class. For example, the dcat:Dataset
inherits from dcat:Resource
, indicating that all properties from dcat:Resource
can also be used in dcat:Dataset
. Note that this does not mean that also the values are inherited, but only the ('empty') properties.
Nested classes: It is possible that a class refers to (another instance) of the same class, e.g. dcat:Catalog
pointing to itself via the property dct:hasPart
. These kind of nested structures can be used to describe the structure of an institution or infrastructure in more detail, for example if an institute (described by dcat:Catalog
) is divided into several independent departments (each described with its one instance of dcat:Catalog
) that produce and publish their own sets of dcat:Dataset
.
Please note that in the current implementation of the Health-RI core schema, there is a limit to the theoretically indefinite flexibility that DCAT offers, especially since the National Health Data Catalogue cannot currently display these layers of nested structures. Read more about it below.
UML diagram of the v2 core metadata schema of Health-RI
Next steps
To map your metadata, you can follow the general tutorial https://health-ri.atlassian.net/wiki/spaces/FSD/pages/290291734/Mapping+pipeline. Then the metadata can be transformed into RDF format and exposed using a FAIR Data Point.
More information about this step can be found here: 4B Exposing metadata
Additional resources
Technical details on DCAT AP and FAIR Datapoints - Youtube video, Health-RI
HRI Github - You can find recourses and examples on the Health-RI metadata Github.Â
Resources from the EU Open Data Explained, including a general training on metadata and basic and advanced level resourses on DCAT and DCAT-AP.
FAIR Metrolines (note: some pages under developement):
Metroline Step: Assess availability of your metadata
Metroline Step: Register resource level metadata
Metroline Step: Analyse data semantics
Questions?
If you have questions about the onboarding process or would like to learn more. Reach out to our https://www.health-ri.nl/health-ri-servicedesk