/
Step 4. Conceptualization and semantic modeling

Step 4. Conceptualization and semantic modeling

status: in development

Short description

With the inventory of terms and definitions at hand, borrowed as much as possible from widely used ontologies, the task ahead involves conceptualisation, which encompasses the organization of related metadata fields (properties or attributes) into groups (classes), how these classes relate to each other and relate to the Health-RI metadata schema. In a UML diagram the metadata fields, groups and relations can be visualized.

The next step is to turn this conceptual model into a semantic model which is a more formal representation leveraging ontologies (such as DCAT, PROV-O, FOAF) to unambiguously define the concepts - the metadata fields, their grouping and relations between the groupings - ensuring a shared understanding and reasoning across systems, allowing for interoperability and automated processing. For all concepts Unique Resource Identifiers (URIs) are used, and usually the model is expressed in RDF (Resource Description Framework), a W3C recommendation for (meta)data interchange on the web. If a concept has already been modelled and defined within a certain community, we should adopt this existing definition rather than creating a new one. In certain situations, it may be necessary to expand upon the definition of an existing concept.

This step requires substantial expertise in semantic modeling and usually requires many modeling sessions where domain and modeling experts have to work intensively together to grasp and model the semantics in a correct way. It is not within scope of this document to treat the (ontology driven) modeling process comprehensively; our intention is to give some general pointers and considerations.

Besides the interoperability and shared understanding aspect of semantic models, we’d like to highlight another aspect that is relevant in the context of search functionality of a catalog. If the catalog knows that "heart disease" is a type of "cardiovascular condition", reasoning helps it to find datasets related to "heart disease" even if you only searched for "cardiovascular conditions." Reasoning can play a crucial role in enhancing search functionality in a catalogue by enabling rich inferences based on the underlying semantic models.

Deliverables

Deliverable

Description

Deliverable

Description

Semantic model

UML-diagram with defined classes, properties, namespaces, and (type and cardinalities of the) relations.

List of metadata fields definitions used in the model

A detailed list of metadata elements including definitions, attributes, and relationships.

Modeling decision log

A documented record of decisions made throughout the development of the model.

How

1. Work with one or more example datasets

Take a dataset from your domain for which you do the metadata modeling as an example. Together with the scope statement (step 2) this helps to stay focussed while modeling. It also helps modeling experts to understand the dataset (context) better as they might lack domain expertise.

2. Organize modeling sessions

Arriving at a semantic model asks for thorough, time-consuming discussions and understanding about the meaning (semantics) of the concepts and context in question. To consider for such modeling sessions:

  • Make sure you have both domain experts and modeling experts at the table

  • Provide everyone with (access to) the relevant prior information (e.g. the results from previous steps like the inventory list, scope statement, example dataset)

  • Have a whiteboard (either a physical one or a virtual one like draw.io or MS Whiteboard) present for quickly sketching diagrams and relations

  • Take enough time. Usually one hour is too short. Plan several sessions ahead, ideally without too much time in between

  • Keep notes, record the meeting (if online/hybrid) and log decisions

  • Think of work formats, depending on the group size, borrowing from design techniques like, solution sketching (see also here); dividing the group into subgroups or individuals to work independently; instead of discussing sketches directly in the whole group, have people in the group individually write down questions and ideas per sketch, etc.

3. Create a conceptual model

Create a high-level conceptual model (preferably in the form of a UML diagram) that represents the domain’s key concepts and relationships. Take the Health-RI core/health metadata schema as a basis and

Tooling: draw.io, Visual Paradigm, Miro, Lucidchart, Astah, Excalidraw

4. Create a semantic model

Describe

Tooling: (Web)Protégé, Metaphactory, TopBraid EDG, PoolParty

Considerations

Rephrase this from the Generic FAIRification process article: We have found that making optimal choices, demands good searching skills and experience. For instance, it is generally insufficient to just choose the first ontology in the list provided by ontology search tools by definition. Instead one should also check the usability license, usage statistics, update activity, whether the ontology contains a good class and property structure (which generally facilitates data integration), and whether a general ontological framework is used (such as OBO Foundry [15]). Nevertheless, it may be very difficult to decide which term from which ontology should be used, i.e., to match the detail in domain specific ontologies with the detail that is needed to describe data elements correctly. Terms used in human narrative do not always match directly with the ontological representation of the term. If the search is unsuccessful, new ontology terms could be defined and added to existing ontologies or new ontologies could be developed. This is however a time-consuming process that should be undertaken with a team of experts from both the domain of the study as well as in consultation with ontology experts.

Criteria see https://faircookbook.elixir-europe.org/content/recipes/interoperability/selecting-ontologies.html#selecting-terminologies (used by domain, DCAT-compatible, license issues, mapping to other ontologies, maintenance, Logical Consistency and Reasoning Support, etc.)

 

Model extension

Within DCAT and DCAT-AP, the term "resource" generally encompasses all objects that can be described using RDF. However, there are specific categories and attributes used to indicate the different types of resources:

  • dcat:Dataset is a type of dcat:Resource representing a collection of data

  • dcat:Distribution is a type of dcat:Resourcee representing an available form or representation of a dataset.

  • dcat:Catalog is a type of dcat:Resource representing a collection of datasets.

  • dcat:DataService , introduced in DCAT version 2, is a type of Resource representing a service for accessing data.

In DCAT and DCAT-AP, the vocabulary is focused on datasets. Nonetheless, users may need to portray a variety of resources specific to certain domains, like biobanks or patient registries. In such cases, we propose potential scenarios for modifying or augmenting DCAT to accurately depict your resource type:

  • Use dcat:Resource directly: If the asset you are dealing with is not in line with the dcat:Dataset definition, you can use the broader term dcat:Resource. This term allows you to represent almost any type of asset. However, this approach may not be completely clear for users who are trying to understand the essence of the asset. We can de define the asset type further with specific vocabularies over time.

  • Expand with Personalised Classes: If there is a need to represent specific resources, such as biobanks or patient registries, it may be beneficial to supplement the foundational DCAT vocabulary with custom classes. For example:

:Collection a rdfs:Class ;

rdfs:subClassOf dcat:Resource .

and

:PatientRegistry a rdfs:Class ;

rdfs:subClassOf dcat:Dataset .

When creating custom classes, it is essential to provide detailed metadata for each type of resource. This will enable users and systems to distinguish between them and comprehend their subtle differences. For instance, consider the distinction between a collection and a dataset. Therefore, it is crucial to provide specific and unambiguous information to ensure complete understanding.

HRI hub involvement in this step

In this step Health-RI should be consulted.

Further reading