Mapping pipeline
status: in development
ย
- 1 Introduction
- 2 General metadata mapping pattern
- 2.1 1. Understand your data and metadata to be onboarded
- 2.2 2. Understand the ontology (DCAT)
- 2.3 3. Define URIs for each row
- 2.4 4. Map Columns to Properties
- 2.5 5. Convert Values
- 2.6 6. Use a Mapping Language or Tool
- 2.7 7. Create RDF Triples
- 2.8 8. Validate and Refine
- 2.9 9. Share and Publish your validated metadata graph as FDP
- 3 Additional resources
- 4 Questions?
Introduction
In this section, we describe the proccess of metadata mapping and the steps you should take. This page is intended for data stewards, data experts, or equivalent roles. For a general overview, please refer to our general Metadata mapping overview: https://health-ri.atlassian.net/wiki/pages/createpage.action?spaceKey=FSD&title=2.%20Metadata%20mapping&linkCreation=true&fromPageId=290291734
ย
General metadata mapping pattern
ย
ย
ย
1. Understand your data and metadata to be onboarded
Before starting the mapping process, it is crucial to understand the structure of your metadata and the semantic meaning of each column.ย ย
Next, you need to extract and curate the metadata from the dedicated databases at source. The output of this step is metadata that is sourced, cleaned, wrangled, and ready to go through the transformation pipeline.ย
Also, in this step you decide how each piece of data relates to RDF concepts like classes, properties, and entities.ย
Each file (e.g., CSV or JSON) describes a dataset, resource, image or sample.ย ย
Each row in the CSV can be mapped to the target properties and target classes in the Core Metadata Schema GitHub - Health-RI/health-ri-metadata: health ri metadata schemas .ย ย
ย
2. Understand the ontology (DCAT)
An ontology defines the vocabulary (classes, properties, etc.) used to describe your data in RDF.ย ย
In our case, we use DCAT v3 for transformation purposes and DCAT-AP for evaluation purposes. DCAT-AP is a constraint model, which helps to understand which fields are mandatory and other constraints.ย ย
This step is vital for ensuring interoperability and making your data understandable and reusable by others.ย
ย
3. Define URIs for each row
Determine what each row in your CSV represents.ย ย
The URI acts as a unique identifier for resources in the RDF world.ย
ย
4. Map Columns to Properties
Each column in the CSV usually corresponds to a property of your primary resources. Map each column to an RDF property defined in your ontology.
For instance, a column named "title" might map to a property such as dcat:title
in the dcat:Resource
class.ย
ย
5. Convert Values
Transform the values in each cell into RDF literals or resources, depending on their nature. For literal values (e.g., names, descriptions), use the cell's content directly. For values representing relationships or references to other entities, you will need to create or use existing URIs, linking to controlled vocabularies.
ย
6. Use a Mapping Language or Tool
Several languages and tools can automate the mapping process from CSV to RDF, such as:ย
RML (RDF Mapping Language): An extension of R2RML for mapping various file formats, including CSV, to RDF.
Tarql (Transforming ARbitrary Queries into Linked data): A command-line tool for mapping CSV to RDF using SPARQL-like templates.
OpenRefine: A powerful tool for working with messy data, including features for converting data to RDF.ย
ย
7. Create RDF Triples
Using the mappings you have defined, generate RDF triples for each row in your file. Each triple consists of a subject (the resource URI), a predicate (the property URI), and an object (the value or another resource URI).ย
8. Validate and Refine
After converting your data, validate the RDF output to ensure it accurately represents your original data and adheres to the ontology's structure. You may need to refine your mappings or data to correct any issues.ย ย
Health-RI RDF Validator using SHACL shapes can be found here. The GitHub repository is available here. (Note: This repository and all SHACL shapes are still under active development)
9. Share and Publish your validated metadata graph as FDP
Once your RDF data is ready, consider how you will share or publish it to make it accessible to your community, for instance, through FDP. This might involve hosting it on a SPARQL endpoint, within a triple store, or through other data publishing platforms.ย
ย
ย
Additional resources
HRI shacles: health-ri-metadata/Formalisation(shacl)/Core at master ยท Health-RI/health-ri-metadata
Core Metadata Schema Specification
Example of a metadata graph: https://github.com/Health-RI/health-ri-metadata/tree/master/MapToDCAT-AP/Metadata%20graphs%20-%20Examples
Example of mapping: https://github.com/Health-RI/health-ri-metadata/tree/master/MapToDCAT-AP/Example
Image2Catalog: GitHub - Health-RI/img2catalog: Repository for a tool to help make XNAT into a FAIR Data Point This tool queries an XNAT instance and generates DCAT-AP 3.0 metadata.
Questions?
If you have questions about the onboarding process or would like to learn more. Reach out to our Health-RI Servicedesk | Health-RI
ย