• In progress
  • Metadata mapping

    status: in development

     

    Introduction

    In this section, we describe the basics of metadata and explain what metadata mapping is. We also look at the Health-RI Core Metadata Schema and the metadata standards it builds upon. This page is intended for a general audience. For details on the standards and the schema, please visit the github specifications dedicated for data experts or data stewards https://github.com/Health-RI/health-ri-metadata/ .

    What is metadata

    Metadata is essentially data about data. It provides information that describes various aspects of your data, such as its description, the owner of the data, the type of data. In other words, metadata helps understanding and managing data effectively by providing additional information about it.

    Metadata serves as the backbone of effective data management and analysis in the life sciences and healthcare domains. It enables researchers, clinicians, and policymakers to derive meaningful insights from vast amounts of data while ensuring its integrity, reliability and confidentiality, by provide a standardization, interoperability and machine readabilty of the shared metadata.

    Metadata standards

    A metadata standard is a set of rules, guidelines and conventions that define how metadata should be structured, formatted and described within a particular domain or context. Adhering to such standards ensures consistency, interoperability and effective management of metadata across different systems, organizations and disciplines.

    Here are some commonly used metadata standards we use at Health-RI:

    Dublin Core (DC): Dublin Core is a widely used metadata standard designed to provide a simple and standardized way to describe digital resources such as documents, web pages, images, videos and other types of content on the internet. It was originally developed in 1995 by the Dublin Core Metadata Initiative (DCMI)

    DCAT: Data Catalog Vocabulary is a metadata standard specifically designed for describing datasets and data catalogs on the web. DCAT is based on RDF (Resource Description Framework), which is a standard model for representing and exchanging metadata and data on the web in a machine-readable format (ie. data structured in a way processable by a computer).

    DCAT-AP: DCAT Application Profile for Data Portals in Europe is a metadata standard developed by the European Commission to facilitate the interoperability of data catalogs and portals across European countries. It builds upon the DCAT (Data Catalog Vocabulary) standard and extends it with additional requirements and recommendations tailored to the European context.

    HRI Metadata Schema

    The National Health Data Catalogue currently works with a Core Metadata Schema. This Core Metadata Schema is a formal shared conceptualisation of the requirements to find and reuse information across Health-RI nodes via the National Catalogue. It represents a set of minimal elements for describing each resource (including dataset) with common metadata. The current version of the Core Metadata Schema includes DCAT v3 and some selected DCAT-AP mandatory classes and their definitions.

    The set is split into several classes describing the data. At the moment four classes (Dataset, Catalog, Resource, and Agent) are mandatory. Each class is populated by a set of mandatory and recommended variables. You can find all of the descriptions of variables and classes here: https://health-ri.atlassian.net/wiki/spaces/FSD/pages/121110529

     

     

    HRICoreMetadataSchemaReleasePlateau1.jpg
    Health-RI Core Metadata Schema: With four mandatory classes (Dataset, Catalog, Resource, Agent) and their variable labelled as either mandatory or recommended.

     

    What is metadata mapping

    Metadata mapping and creation of a metadata schema will likely require involvement of a semantic expert, data steward or equivalent.

    Metadata mapping is the process of establishing connections between corresponding metadata values or fields across different systems. In simple terms, it ensures that your metadata schema for your data is transformed to the HRI metadata schema in the correct way. It involves identifying and linking similar pieces of metadata information from one system to the relevant content or data elements in another system. This mapping ensures consistency and coherence between disparate datasets or databases, allowing for efficient data integration and interoperability. By associating equivalent metadata values or fields, metadata mapping enables seamless communication and exchange of information between systems, facilitating accurate data discovery, retrieval, and interpretation.

    Below is an example of metadata from the PRISMA study. It contains information about the data available:

    Class

    Property

    Property Label

    Example

    dcat:Catalog

    dct:description

    Description

    The primary aim of the PRISMA study is to investigate the potential value of risk-tailored versus traditional breast cancer screening protocols in the Netherlands. Data collection took place between 2014-2019, resulting in ∼67,000 mammograms, ∼38,000 surveys, ∼10,000 blood samples and ∼600 saliva samples.

    dct:publisher

    Publisher

    foaf:Agent

    dct:title

    Title

    Personalised RISk-based MAmmascreening Study (PRISMA)

    dcat:Dataset

    dcat:contactPoint

    Contact Point

    vcard:Kind

    dct:creator

    Creator

    foaf:Agent

    dct: description

    Description

    The extensive questionnaire covers a number of potential breast cancer risk predictors such as demographics, personal characteristics, reproductive characteristics, medication, lifestyle, health status, family history, psychosocial characteristics.

    dct:issued

    Release date

    2024-07-02T10:49:07

    dct: identifier

    Identifier

    https://fdp.radboudumc.nl/dataset/37d6ad17-aa35-425c-946e-855838d3f9cc

    dct:modified

    Modified

    2024-09-09T08:54:32

    dct:publisher

    Publisher

    foaf:Agent

    dcat:theme

    Theme

    http://publications.europa.eu/resource/authority/data-theme/HEAL

    dct:title

    Title

    PRISMA Questionnaire data

    dct:license

    License

    https://data.ru.nl/doc/dua/RUMC-RA-DUA-1.0.html

    dcat:Distribution

    dcat:accessURL

    Access URL

    DOI (not yet available)

    dcat:mediaType

    Format

    https://www.iana.org/assignments/media-types/text/csv

    dcat:title

    Title

    PRISMA Questionnaire data - CSV format

    dcat:description

    Description

    The questionnaire data in CSV format.

    foaf:Agent

    foaf:name

    name

    Radboudumc (Publisher)

    dct:identifier

    identifier

    Research Organization Registry (ROR) Search (Publisher)

    vcard:Kind

    vcard:hasEmail

    has email

    firstname.lastname@radboudumc.nl

    vcard:hasName

    has name

    J. Doe

    foaf:Agent

    foaf:name

    name

    J. Doe (Creator)

    dct:identifier

    identifier

    https://orcid.org/0000-0000-0000-0000 (Creator)

    Here is the same data mapped towards the Health-RI metadata core. It contains the same information, however, now this data is machine readable and is in a format that is common for many places on the web.

    @prefix dcat: <http://www.w3.org/ns/dcat#> . @prefix dct: <http://purl.org/dc/terms/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix vcard: <http://www.w3.org/2006/vcard/ns#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . # Catalog description <https://fdp.radboudumc.nl/catalogue/prisma> a dcat:Catalog ; dct:title "Personalised RISk-based MAmmascreening Study (PRISMA)" ; dct:description "The primary aim of the PRISMA study is to investigate the potential value of risk-tailored versus traditional breast cancer screening protocols in the Netherlands. Data collection took place between 2014-2019, resulting in ∼67,000 mammograms, ∼38,000 surveys, ∼10,000 blood samples and ∼600 saliva samples." ; dct:publisher [ a foaf:Agent ; foaf:name "Radboudumc (Publisher)" ; dct:identifier <https://ror.org/05wg1m734> ] ; dcat:dataset <https://fdp.radboudumc.nl/dataset/37d6ad17-aa35-425c-946e-855838d3f9cc> . # Dataset description <https://fdp.radboudumc.nl/dataset/37d6ad17-aa35-425c-946e-855838d3f9cc> a dcat:Dataset ; dct:title "PRISMA Questionnaire data" ; dct:description "The extensive questionnaire covers a number of potential breast cancer risk predictors such as demographics, personal characteristics, reproductive characteristics, medication, lifestyle, health status, family history, psychosocial characteristics." ; dct:issued "2024-07-02T10:49:07"^^xsd:dateTime ; dct:modified "2024-09-09T08:54:32"^^xsd:dateTime ; dct:identifier <https://fdp.radboudumc.nl/dataset/37d6ad17-aa35-425c-946e-855838d3f9cc> ; dct:creator [ a foaf:Agent ; foaf:name "J. Doe (Creator)" ; dct:identifier <https://orcid.org/0000-0000-0000-0000> ] ; dct:publisher [ a foaf:Agent ; foaf:name "Radboudumc (Publisher)" ; dct:identifier <https://ror.org/05wg1m734> ] ; dcat:theme <http://publications.europa.eu/resource/authority/data-theme/HEAL> ; dct:license <https://data.ru.nl/doc/dua/RUMC-RA-DUA-1.0.html> ; dcat:distribution <https://fdp.radboudumc.nl/distribution/csv> ; dcat:contactPoint [ a vcard:Kind ; vcard:hasEmail <mailto:firstname.lastname@radboudumc.nl> ; vcard:fn "J. Doe" ] . # Distribution details (CSV) <https://fdp.radboudumc.nl/distribution/csv> a dcat:Distribution ; dcat:accessURL <doi:not_yet_available> ; dcat:mediaType <https://www.iana.org/assignments/media-types/text/csv> ; dcat:title "PRISMA Questionnaire data - CSV format" ; dcat:description "The questionnaire data in CSV format." . # Agent description (Publisher) <https://ror.org/05wg1m734> a foaf:Agent ; foaf:name "Radboudumc (Publisher)" ; dct:identifier <https://ror.org/05wg1m734> . # Agent description (Creator) <https://orcid.org/0000-0000-0000-0000> a foaf:Agent ; foaf:name "J. Doe (Creator)" ; dct:identifier <https://orcid.org/0000-0000-0000-0000> .

     

    To map your metadata you first need to understand the structure of your metadata and their semantic meaning and the ontology (vocabulary) used to to describe your data in a Resource Description Framework (RDF), in our case DCAT V3, format. The general outline of the mapping pipeline can be found here: https://health-ri.atlassian.net/wiki/spaces/FSD/pages/edit-v2/290291734?draftShareId=ff45a2e2-80ee-49aa-b6d6-c04dedb6f9f8

    Next steps

    After mapping/transforming your data properties to the classes and variables of the HRI model, you need to validate your model. This step ensures that the new model both accurately represent the original data as well as adheres to the HRI metadata structure.

    Once your RDF data is ready, you can publish it to FAIR Data Point, where it can be harvested by the Catalogue. More information about this step can be found here: https://health-ri.atlassian.net/wiki/spaces/FSD/pages/279183386

     

    Additional resources

    Technical details on DCAT AP and FAIR Datapoints - Youtube video, Health-RI

    HRI Github - You can find recourses and examples on the Health-RI metadata Github. 

    Resources from the EU Open Data Explained, including a general training on metadata and basic and advanced level resourses on DCAT and DCAT-AP.

    FAIR Metrolines (note: some pages under developement):

    Metroline Step: Register resource level metadata

    Metroline Step: Analyse data semantics

    Metroline Step: Apply (meta)data model

    Metroline Step: Create or reuse a semantic (meta)data model