STATUS: IN DEVELOPMENT
Short Description
The semantic model for a dataset describes the meaning of entities (data elements) and their relations in the dataset accurately, unambiguously, and in a computer-actionable way [GOFAIR_Process]. This model can then be applied to the non-FAIR data to transform it into linkable data, which can be queried. Given that generating a semantic model is often the most time-consuming part of the FAIRification process, it is important to first check whether a semantic model is already available for reuse. Creating such a model from scratch requires domain expertise on the dataset and expertise in semantic modeling.
For metadata, semantic models describing generic items are available. For example, DCAT can be used to describe a data set [Generic].
Why is this step important
Semantic modelling makes it possible that your data and metadata are machine-actionable in order to enable secondary use of your data. After performing this step, your data is being represented as FAIR digital objects (FDO). FDOs are digital objects identified by a Globally Unique, Persistent and Resolvable IDentifier (GUPRID) and described by metadata. This enables the transformed FAIR data set to be efficiently incorporated in other systems, analysis workflows, and unforeseen future applications.
Expertise requirements for this step
Experts that may need to be involved, as described in Metroline Step: Build the Team, include:
Semantic data modelling specialist: creates a new (meta)data model or applies an existing one, ensures that the semantic representation correctly represents the domain knowledge.
Domain expert: make sure that the exact meaning of the data is understood by the modeler.
[BEAT-COVID paper]
We developed ontological models for data record in collaboration with data collectors, data managers, data
analysts and medical doctors.
How to
(I) Reusing a semantic (meta)data model
Given that generating a semantic model is often the most time-consuming part of the FAIRification process, it is highly recommended to first check whether a semantic model is already available for reuse.
If you would like to include your dataset in the National Health Data Catalogue, your metadata needs to use Health-RI’s Core Metadata Schema. For more information about this and how to apply it, please refer to section 2. Metadata mapping in documentation.
For metadata, semantic models describing generic items are available to be reused, e.g., DCAT to describe data set description. Domain-specific items should be decided by each individual self-identified domain, and need thereafter to be described in a semantic metadata model. [Generic]
Some examples of existing metadata models:
BEAT-COVID metadata model is a metadata model that allows the metadata of Beat-COVID 19 data resources to be stored and managed by FAIR data points.
Some examples of existing data models:
EJP RD has defined a semantic model for the CDEs (CARE-SM). The data consists of the Common data elements tagged with ontologies and with the relationships between each data element also defined by ontologies.
The OMOP Common Data Model is another data model that provides standardized vocabularies.
(II) Creating a semantic (meta)data model
Building a semantic data model can be defined in four steps:
Step 1. Create a conceptual model:
Start by creating an abstract conceptual model:
list the main concepts (classes) of the data elements to be FAIRified
what are the relationships between the data elements
It is important that both the data representation (format) and the meaning of the data elements (the data semantics) are clear and unambiguous (see Analyse data semantics).
To help you understand what you would like to include in your model, you can start by creating a list of questions (competency queries). These can serve as a guide to identify the most relevant (meta)data elements to model.
Step 2: Search for ontology terms
Next, the concepts and relations between the data elements in the data set are substituted with the machine-readable classes and properties from ontologies, vocabularies and thesauri. An ontology is a formal representation of a domain knowledge where concepts are organized hierarchically and generally best serves the FAIRification process. More information about ontologies in FAIRcookbook or RDM kit.
Ontologies, and the concepts and properties that they describe, can be found using search engines, such as:
Search engine | Short description |
---|---|
The OLS (by EMBL-EBI) is a repository for biomedical ontologies that aims to provide a single point of access to the latest ontology versions. You can browse the ontologies through the website as well as programmatically via the OLS API. More info here. | |
BioPortal is a repository of biomedical ontologies. | |
The Basic Register of Thesauri, Ontologies & Classifications (BARTOC) is a database of Knowledge Organization Systems and KOS related registries with the goal to list as many Knowledge Organization Systems as possible at one place. More info here. | |
Ontobee is a web-based linked data server and browser specifically designed for ontology terms. It supports ontology visualization, query, and development, provides a web interface for displaying the details and its hierarchy of a specific ontology term. More info here. |
Ontologies for different purposes can also be found in the FAIR cookbook, as well as on this page
When choosing an ontology, several selection criteria might apply (see also FAIR cookbook):
Update activity: Is it well maintained, i.e. frequent release, term requests handling, versioning and deprecation policies clarified?
Is it well documented? There should be enough metadata for each class in the artefact and enough metadata about the artefact itself
Usability license: What license and terms of use does it mandate?
Does the ontology contain a good class and property structure (this generally facilitates data integration)
What format does it come in?
Are there stable persistent resolvable identifiers for all terms?
Usage statistics: Who use it and what resources are being annotated with it?
Is a general ontological framework used (such as OBO Foundry).
Finding the right ontology might be time-consuming and require thorough searching and some practice, since the first ontology provided by a search tool might not always be the best fit. It may be difficult to decide which term from which ontology should be used, i.e., to match the detail in domain specific ontologies with the detail that is needed to describe data elements correctly. Terms used in human narrative do not always match directly with the ontological representation of the term.
If the search is unsuccessful, new ontology terms could be defined and added to existing ontologies or new ontologies could be developed. This is however a time-consuming process that should be undertaken with a team of experts from both the domain of the study as well as in consultation with ontology experts.
Step 3: Create a semantic data model from Steps 1 and 2.
Finally, combine the conceptual model and the ontology terms to create the detailed semantic data model. This model distinguishes between the data items (instances and their values) and their types (classes), is an exact representation of the data and exposes the meaning of the data in machine-readable terms.
Step 4: Check the usability of your model
To check the usability of your model (a reality check), expose the model to actual (meta)data to identify errors and gaps in the model. Correct the model according to these errors and gaps.
Repeat this step until no great errors occur any more in light of the competency questions.
[Optional] Evaluation of semantic (meta)data models
To verify the semantic model, competency questions can be used. CQs are an efficient way of of testing models, since they are based on real questions. CQs are evaluated by means of the query used to answer them. In other words, if it is possible to write a query that returns proper answers to the question, then the CQs is validated.
In the BEAT-COVID project, the ontological models were evaluated using competency questions that are based on realistic questions posed by data model users which are proposed as means to verify the scope (e.g.,what is relevant to solve the challenges) and the relationships between concepts (e.g., check for missing or redundant relationships). A preliminary set of CQs from meetings with domain experts is available on Github: https://github.com/LUMC-BioSemantics/beat-covid/tree/master/fair-data-model/cytokine/competency-questions
Practical Examples from the Community
This section should show the step applied in a real project. Links to demonstrator projects.
Hey, we hired a modeler, chose variables
See this and that project
References & Further reading
[DCAT] https://www.w3.org/TR/vocab-dcat-3/
[Generic] https://direct.mit.edu/dint/article/2/1-2/56/9988/A-Generic-Workflow-for-the-Data-FAIRification
[GOFAIR_Process] https://www.go-fair.org/fair-principles/fairification-process/
[RDMKit_metadata] https://rdmkit.elixir-europe.org/metadata_management#how-do-you-find-appropriate-vocabularies-or-ontologies
[BEAT-COVID project] https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-022-00263-7
Authors / Contributors
Experts whom you can contact for further information