Metroline Step: Create or reuse a semantic (meta)data model
status: On Hold On 17-9-2024 it was decided to put this page on hold and focus on describing the petal process first. When that part is finished, parts of the information (see e.g. step 4) will be generalised for this page.
Short description
`Generating a semantic model is often the most time-consuming step of data FAIRification. However, we expect the modelling effort to diminish as more and more models are made available for reuse over time, especially if such models are treated as FAIR digital objects themselves. Thus, it is important to first check whether a semantic model already exists for the data and the metadata that may be reused. For cases where no semantic model is available a new one needs to be generated.` (Generic)
The semantic model for a dataset describes the meaning of entities (data elements) and their relations in the dataset accurately, unambiguously, and in a computer-actionable way [GOFAIR_Process]. This model can then be applied to the non-FAIR data to transform it into linkable data, which can be queried. Given that generating a semantic model is often the most time-consuming part of the FAIRification process, it is important to first check whether a semantic model is already available for reuse. Creating such a model from scratch requires domain expertise on the dataset and expertise in semantic modeling.
For metadata, semantic models describing generic items are available. For example, DCAT can be used to describe a data set [Generic].
Why is this step important
Semantic modelling makes it possible that your data and metadata are machine-actionable in order to enable secondary use of your data. After performing this step, your data is being represented as FAIR digital objects (FDO). FDOs are digital objects identified by a Globally Unique, Persistent and Resolvable IDentifier (GUPRID) and described by metadata. This enables the transformed FAIR data set to be efficiently incorporated in other systems, analysis workflows, and unforeseen future applications.
How to
(I) Reusing a semantic (meta)data model
Given that generating a semantic model is often the most time-consuming part of the FAIRification process, it is highly recommended to first check whether a semantic model is already available for reuse.
If you would like to include your dataset in the National Health Data Catalogue, your metadata needs to use Health-RI’s Core Metadata Schema. For more information about this and how to apply it, please refer to section https://health-ri.atlassian.net/wiki/pages/createpage.action?spaceKey=FSD&title=2.%20Metadata%20mapping&linkCreation=true&fromPageId=277839878.
For metadata, semantic models describing generic items are available to be reused, e.g., DCAT to describe data set description. Domain-specific items should be decided by each individual self-identified domain, and need thereafter to be described in a semantic metadata model. [Generic]
Some examples of existing metadata models:
BEAT-COVID metadata model is a metadata model that allows the metadata of Beat-COVID 19 data resources to be stored and managed by FAIR data points.
Some examples of existing data models:
EJP RD has defined a semantic model for the CDEs (CARE-SM). The data consists of the Common data elements tagged with ontologies and with the relationships between each data element also defined by ontologies.
The OMOP Common Data Model is another data model that provides standardized vocabularies.
(II) Creating a semantic (meta)data model
Building a semantic data model can be defined in four steps:
Step 1. Create a conceptual model:
Start by creating an abstract conceptual model:
list the main concepts (classes) of the data elements to be FAIRified;
what are the relationships between the data elements.
It is important that both the data representation (format) and the meaning of the data elements (the data semantics) are clear and unambiguous (see Analyse data semantics).
To help you understand what you would like to include in your model, you can start by creating a list of questions (competency queries). These can serve as a guide to identify the most relevant (meta)data elements to model.
Step 2: Search for ontology terms
Next, the concepts and relations between the data elements in the data set are substituted with the machine-readable classes and properties from ontologies, vocabularies and thesauri. An ontology is a formal representation of a domain knowledge where concepts are organized hierarchically and generally best serves the FAIRification process. More information about ontologies in FAIRcookbook or RDM kit.
Ontologies, and the concepts and properties that they describe, can be found using search engines, such as:
Search engine | Short description |
---|---|
BioPortal is a repository of biomedical ontologies. | |
The OLS (by EMBL-EBI) is a repository for biomedical ontologies that aims to provide a single point of access to the latest ontology versions. You can browse the ontologies through the website as well as programmatically via the OLS API. More info here. | |
Develops interoperable ontologies for biomedical sciences. Participants follow and contribute to the development of a set of principles to ensure that ontologies are logically well-formed and scientifically accurate. More info here. | |
The Basic Register of Thesauri, Ontologies & Classifications (BARTOC) is a database of Knowledge Organization Systems and KOS related registries with the goal to list as many Knowledge Organization Systems as possible at one place. More info here. | |
Ontobee is a web-based linked data server and browser specifically designed for ontology terms. It supports ontology visualization, query, and development, provides a web interface for displaying the details and its hierarchy of a specific ontology term. More info here. | |
Browser for ontologies for agricultural science based on NBCO BioPortal. |
Ontologies for different purposes can also be found in the FAIR cookbook, as well as on this page.
When choosing an ontology, several selection criteria might apply (from FAIR cookbook):
Exclusion criteria:
Absent licence or terms of use (indicator of usability)
Restrictive licences or terms of use with restrictions on redistribution and reuse
Absence of term definitions
Absence of sufficient class metadata (indicator of quality)
Absence of sustainability indicators (absence of funding records)
Inclusion criteria:
Scope and coverage meets the requirements of the concept identified
Unique URI, textual definition and IDs for each term
Resource releases are versioned
Size of resource (indicator of coverage)
Number of classes and subclasses (indicator of depth)
Number of terms having definitions and synonyms (indicator of richness)
Presence of a help desk and contact point (indicator of community support)
Presence of term submission tracker/issue tracker (indicator of resource agility and capability to grow upon request)
Potential integrative nature of the resource (as indicator of translational application potential)
Licensing information available (as indicator of freedom to use)
Use of a top level ontology (as indicator of a resource built for generic use)
Pragmatism (as indicator of actual, current real life practice)
Possibility of collaborating: the resource accepts complaints/remarks that aim to fix or improve the terminology, while the resource organisation commits to fix or improve the terminology in brief delays (one month after receipt?)
Finding the right ontology might be time-consuming and require thorough searching and some practice, since the first ontology provided by a search tool might not always be the best fit. It may be difficult to decide which term from which ontology should be used, i.e., to match the detail in domain specific ontologies with the detail that is needed to describe data elements correctly. Terms used in human narrative do not always match directly with the ontological representation of the term.
If the search is unsuccessful, new ontology terms could be defined and added to existing ontologies or new ontologies could be developed. This is however a time-consuming process that should be undertaken with a team of experts from both the domain of the study as well as in consultation with ontology experts.
Step 3: Create a semantic data model from Steps 1 and 2.
Finally, combine the conceptual model and the ontology terms to create the detailed semantic data model. This model distinguishes between the data items (instances and their values) and their types (classes), is an exact representation of the data and exposes the meaning of the data in machine-readable terms.
Step 4: Check the usability of your model
To check the usability of your model (a reality check), expose the model to actual (meta)data to identify errors and gaps in the model. Correct the model according to these errors and gaps.
Repeat this step until no great errors occur any more in light of the competency questions.
[Optional] Step 5: Evaluation of semantic (meta)data models
To verify the semantic model, competency questions (CQ) can be used. CQs are an efficient way of of testing models, since they are based on real questions. CQs are evaluated by means of the query used to answer them. In other words, if it is possible to write a query that returns proper answers to the question, then the CQs is validated.
In the BEAT-COVID project, the ontological models were evaluated using competency questions that are based on realistic questions posed by data model users which are proposed as means to verify the scope (e.g.,what is relevant to solve the challenges) and the relationships between concepts (e.g., check for missing or redundant relationships). A preliminary set of CQs from meetings with domain experts is available on Github: beat-covid/fair-data-model/cytokine/competency-questions at master · LUMC-BioSemantics/beat-covid
Expertise requirements for this step
Experts that may need to be involved, as described in Metroline Step: Build the Team, include:
Semantic data modelling specialist: creates a new (meta)data model or applies an existing one, ensures that the semantic representation correctly represents the domain knowledge.
Domain expert: make sure that the exact meaning of the data is understood by the modeler.
In the BEAT-COVID project, they developed ontological models for data record in collaboration with data collectors, data managers, data analysts and medical doctors [BEAT-COVID paper].
Practical examples from the community
This section should show the step applied in a real project. Links to demonstrator projects.
Hey, we hired a modeler, chose variables
See this and that project
References & Further reading
[DCAT] https://www.w3.org/TR/vocab-dcat-3/
[Generic] https://direct.mit.edu/dint/article/2/1-2/56/9988/A-Generic-Workflow-for-the-Data-FAIRification
[GOFAIR_Process] https://www.go-fair.org/fair-principles/fairification-process/
[RDMKit_metadata] https://rdmkit.elixir-europe.org/metadata_management#how-do-you-find-appropriate-vocabularies-or-ontologies
[BEAT-COVID project] Applying the FAIR principles to data in a hospital: challenges and opportunities in a pandemic - Journal of Biomedical Semantics
Tools and resources on this page
Add the tools and resources mentioned on this page. This should be a list of usable content and does not include textual resources such as journal references.
Training
Relevant training will be added in the future if available.
Suggestions
Visit our How to contribute page for information on how to get in touch if you have any suggestions about this page.