Mapping pipeline

status: in development

1 Introduction
2 General metadata mapping pattern
3 Additional resources
4 Questions?

Introduction

In this section, we describe the process of metadata mapping and the steps you should take. This page is intended for data stewards, data experts, or equivalent roles. For a general overview, please refer to our general Metadata mapping overview: 4A Metadata mapping

General metadata mapping pattern

1. Understand your data and metadata to be onboarded

Before starting the mapping process, it is crucial to understand the structure of your metadata and the semantic meaning of each column.

Next, you need to extract and curate the metadata from the dedicated databases at source. The output of this step is metadata that is sourced, cleaned, wrangled, and ready to go through the transformation pipeline.

Also, in this step you decide how each piece of data relates to RDF concepts like classes, properties, and entities.

Each file (e.g., CSV or JSON) describes a dataset, resource, image or sample.

Each row in the CSV can be mapped to the target properties and target classes in the Core Metadata Schema GitHub - Health-RI/health-ri-metadata at master .

2. Understand the ontology (DCAT)

An ontology defines the vocabulary (classes, properties, etc.) used to describe your data in RDF.

In our case, we use DCAT v3 for transformation purposes and DCAT-AP for evaluation purposes. DCAT-AP is a constraint model, which helps to understand which fields are mandatory and other constraints.

This step is vital for ensuring interoperability and making your data understandable and reusable by others.

3. Define URIs for each row

Determine what each row in your CSV represents.

The URI acts as a unique identifier for resources in the RDF world.

4. Map Columns to Properties

Each column in the CSV usually corresponds to a property of your primary resources. Map each column to an RDF property defined in your ontology.

For instance, a column named "title" might map to a property such as dcat:title in the dcat:Resource class.

5. Convert Values

Transform the values in each cell into RDF literals or resources, depending on their nature. For literal values (e.g., names, descriptions), use the cell's content directly. For values representing relationships or references to other entities, you will need to create or use existing URIs, linking to controlled vocabularies.

6. Use a Mapping Language or Tool

Several languages and tools can automate the mapping process from CSV to RDF, such as:

RML (RDF Mapping Language): An extension of R2RML for mapping various file formats, including CSV, to RDF.
Tarql (Transforming ARbitrary Queries into Linked data): A command-line tool for mapping CSV to RDF using SPARQL-like templates.
OpenRefine: A powerful tool for working with messy data, including features for converting data to RDF.

7. Create RDF Triples

Using the mappings you have defined, generate RDF triples for each row in your file. Each triple consists of a subject (the resource URI), a predicate (the property URI), and an object (the value or another resource URI).

8. Validate and Refine

After converting your data, validate the RDF output to ensure it accurately represents your original data and adheres to the ontology's structure. You may need to refine your mappings or data to correct any issues.

Health-RI RDF Validator using SHACL shapes can be found here. The GitHub repository is available here. (Note: This repository and all SHACL shapes are still under active development)

9. Share and Publish your validated metadata graph as FDP

Once your RDF data is ready, consider how you will share or publish it to make it accessible to your community, for instance, through FDP. This might involve hosting it on a SPARQL endpoint, within a triple store, or through other data publishing platforms.