Metroline Step: Analyse data semantics
status: in development
‘Start with a great quote from, for example, a paper, between single quotes, in italic.' (source as a hyperlink between parenthesis)
In layman’s terms (Jip en Janneke), add an easy to follow summary, using around three sentences.
Short description
Understanding and clearly defining the meaning (semantics) of (meta)data is an important preparation for creating the semantic model, as well as for data collection via, for example, electronic case report forms (eCRFs). In this step, the aim is to ensure you gain a clear and unambiguous understanding of the (meta)data. The step provides guidance for both existing data and data that is to be collected.
To illustrate the issue, consider the example where you receive a dataset with a variable called “date”. Without clearly defined semantics, it is unclear whether this means “date of data collection”, “admission date”, “date of birth”, or something else. This must be resolved before you can start with the semantic (meta)data model. In the How to section of this page we provide instructions to achieve this.
Thus, the outcome of this step is a set of data elements (variables) with clear and unambiguous semantics, known as a codebook. For metadata, the outcome is a […?]. Note that finding machine-actionable items from ontologies for the data elements is not yet part of this step, but is described in Create or reuse a semantic (meta)data model.
Why is this step important
Several of the Metroline steps that follow rely on being familiar with your data. For example, in order to create or reuse your semantic (meta)data model, it is crucial to understand the meaning and relationships of variables.
While performing this step, keep your FAIRification goals in mind. If you have a clear idea of your FAIRification goals, it might be easier to define what elements should be present in your (meta)data and how these elements should be represented.
For example, in a dataset a variable to collect sex-related data might be called ‘sex’. If the semantics of such variable are not provided or not analysed, it would be unclear if it means ‘biological sex at birth', ‘phenotypic sex’, or ‘gender’. These issues have to be solved before you start with the semantic (meta)data model.
How to
This “How to” described 5 steps to take in order to make your (meta)data clear and unambiguous. Since projects vary in their levels of (meta)data semantics, we created two diagrams to help you navigate which steps are relevant for you:
Use the first - green diagram for analysing data semantics.
Use the second - yellow diagram for analysing metadata semantics.
By completing these steps, you will end up with clear and unambiguous semantics for your (meta)data.
To ensure this, it is essential to analyse various aspects of the data elements and variables involved, such as:
The definition/description of data elements. For example, a variable called “sex” could refer to “Biological sex” or “Administrative gender”.
Value ranges for data elements. For example, in system A, sex allows for male and female, while in system B, sex also allows for intersex.
Relationship between data elements. For example, the “sex” variable is one attribute of “patient”, which may imply that the semantics of this “sex” variable is “sex of patient”.
To see which steps are relevant for your (meta)data, please follow the diagrams below.
For analysing data semantics:
For analysing metadata semantics:
For easier understanding, we will follow the example dataset containing patient information with the following metadata:
Metadata Field | Value |
---|---|
Dataset Name | Health Data |
Date of Upload | 01/02/2023 |
Keywords | BP, HR, Conditions |
Creator | Dr. Smith |
Description | Patient health data including BP and HR |
Format | CSV |
Source | Hospital A |
Rights | Open |
In this example, we are working with existing metadata. According to the diagram, we should start with Step 1 - Compile information.
Step 1 - Compile information
Compile all the information of data elements, data values, and data structure. Examine the data in the way it is currently stored, including its format (e.g. JSON, CSV) and how the information is organized within it. This step helps to identify inconsistencies, ambiguities, and errors in the data.
a) For existing (meta)data: Locate all relevant sources in which the (meta)data is stored. Compile information about the following:
Which variables are present in the (meta)data?
For example, check the eCRFs used to collect the data.
What are the value ranges for each variable, as in determine the type of values each variable can have?
For example Dataset Name has a range of Text (e.g., "Health Data").
In our example, we are working on FAIRifying an already existing metadata of a dataset. Let’s compile and examine the information we have.
Value | Metadata field / Variable | Description of the field | Value range |
---|---|---|---|
Health Data | Dataset Name | The name of the dataset. | Text |
01/02/2023 | Date of Upload | The date on which the dataset was uploaded | Date values in the format MM/DD/YYYY |
BP, HR, Conditions | Keywords | Terms that describe the main topics of the dataset | Text |
Dr. Smith | Creator | The person or organisation that created the dataset | Text, in our example title and last name |
Patient health data including BP and HR | Description | A brief summary of the dataset | Text |
CSV | Format | A file format of the dataset | Text, in our example a short string indicating the file format |
Hospital A | Source | The origin of the dataset | Text, in our case the name of the institution |
Open | Rights | The usage rights or licence of the dataset | Text |
We now have a compiled table of information for our example metadata. Looking at the flowchart on the right, we see that we can now proceed to Step 3: Check data semantics.
b) In case you are aiming to collect FAIR (meta)data from the start:
Start by identifying the data elements/variables you plan to collect. Defining these elements early ensures that your dataset is structured, standardized, and aligned with FAIR principles. To guide this process, consider using Competency Questions (CQs) - questions that help clarify which data is necessary and how it will be used.
Here are some resources you can refer to in this step:
Creating data/variable dictionary - helps define variable names, formats and descriptions for your data.
Creating a metadata profile - describes how to create a minimal set of metadata with the help of competency questions.
Once you have identified your data elements, determine the expected value range for each variable where possible. This helps maintain data quality and consistency. Examples include:
‘Biological sex at birth’ → male, female
‘Age’ → 0–110
Clearly defining these aspects from the start supports data integrity, interoperability, and future reuse.
Existing data (1a)
Existing metadata (1a)
New data (1b)
New metadata (1b)
Step 2 - Check for an existing standard/codebook
a) For existing (meta)data: check if it comes with a codebook or metadata standard. In case it does and it is clear, you can use it for your (meta)data and this step is done.
If the codebook is not helpful, you should contact the owner of the data and get the semantics cleared up, so you don’t misinterpret the data. If you see you still need to do additional work in order to make the data clearer, follow the steps below.
b) For new (meta)data: Before defining your own, check if there is a codebook or metadata standard you can use. Using established standards helps to make your (meta)data more interoperable and reusable.
You can find relevant standards and code books by:
Consulting domain experts (for data) or FAIR data stewards/semantic experts (for metadata)
Explore community-driven repositories
Below are listed some of the resources you could explore:
For clinical and biomedical data:
For metadata standards:
FAIRsharing - A curated registry of metadata standards, repositories, and data policies across scientific domains.
RDA Metadata Standards Catalog - A collection of metadata standards maintained by the Research Data Alliance (RDA).
Dublin Core Metadata Initiative - A widely adopted metadata schema for dataset descriptions.
DCAT - A W3C standard for structuring datasets in catalogs.
ISA Model - A metadata framework for describing life science experiments.
Next steps:
If you find a fully compatible codebook or metadata standard, apply it directly.
If you find a partial match, use the relevant elements and document any modifications for missing components.
If no suitable standard exists, proceed to Step 3 - Check data semantics, where you will define and structure metadata elements to ensure consistency and interoperability.
Health-RI, together with domain representatives, will be aiming to develop domain-specific national data standards in the future.
You can find more about metadata standards and ontologies at the following link: Links and Additional reading - How to FAIR
Step 3 - Check (meta)data semantics
Check the data semantics. Is the meaning of the data elements clear and unambiguous? For data elements with ambiguous meaning, try to improve their definition. For this, it might help to look into the intended value range of a variable - is the exact range known and is it clear enough?
In the example of collecting data on a patient’s 'sex', it might be unclear if it means ‘biological sex at birth' or ‘gender’. In another example, 'age' of a subject can be expressed in years, but in some cases, such as studies with small children, could also be expressed in months. It should therefore be clearly stated if the value range for age should be expressed in years or months.
In the below spreadsheet we can see what the issues are with our current metadata and suggested improvements in order to make the meaning of them clearer.
This recipe in the FAIR cookbook gives some additional guidance on specifying the semantics of elements of your data.
Metadata Field | Original Value | Issue | New variable description | New Value Range | New Value |
---|---|---|---|---|---|
Dataset Name | Health Data | Generic and not descriptive. | The name of the dataset. | Text | Patient Health Records 2023 |
Date of Upload | 01/02/2023 | Ambiguous format | Date when the dataset was uploaded, in ISO 8601 format (YYYY-MM-DD). | Date in ISO 8601 format (YYYY-MM-DD) | 2023-01-02 |
Keywords | BP, HR, Conditions | Abbreviations used without context. | Keywords describing the main topics covered by the dataset. | Text | Blood Pressure, Heart Rate, Hypertension |
Creator | Dr. Smith | Generic name without additional identifying information. | Full name and affiliation of the dataset creator, as well as ORCID. | Text and ORCID identifier | Dr. John Smith, Hospital A ORCID: 0001-0002-3456-7890 |
Description | Patient health data including BP and HR | Lacks detail. | Extended description providing context and details about the dataset. | Text | Detailed patient health records including measurements of blood pressure (BP) and heart rate (HR), along with diagnosed medical conditions and prescribed medications. |
Format | CSV | Broad category, can be more detailed. | Data format and version. | Text | CSV, version 1.0 |
Source | Hospital A | Lacks detail, too generic. | Specific department and institution where the data was sourced. | Text and ROR identifier | Hospital A, Department of Cardiology, https://ror.org/example |
Rights | Open | Too broad. | Licensing terms specifying the rights for data usage. | URL to CC License |
Step 4 - Check relationships
Compile information about the relationships between (meta)data elements. For example, if the dataset is in a relational database, the relational schema provides information about the dataset structure, the types involved (the field names), cardinality, etc.
For example, the variable 'biological sex at birth' in a dataset is an attribute of ‘patient’.
In our metadata example above, the Creator (Dr. John Smith) is an employee of the Source (Hospital A, Department of Cardiology) of the dataset.
Step 5 - Check for FAIR features
In addition, check whether the data already contains FAIR features, such as persistent unique identifiers for data elements (for more information, see pre-FAIR assessment).
In our example above, the ORCID of the creator is a unique persistent identifier (F1) for a person.
After having performed the relevant parts of this Metroline step, proceed to the next: Metroline Step: Create or reuse a semantic (meta)data model
Expertise requirements for this step
Below are experts that may need to be involved, as described in Metroline Step: Build the Team.
Semantic data / modelling specialists: Can help understanding the data’s structure and ambiguity.
Domain expert: Can help understanding the data’s elements and their potential ambiguity.
FAIR data steward: Can help identifying FAIR features in the dataset, or data to be collected, as well as ambiguity in the data elements.
Practical examples from the community
Examples of how this step is applied in a project (link to demonstrator projects).
Training
Relevant training will be added soon.
Suggestions
This page is under construction. Learn more about the contributors here and explore the development process here. If you have any suggestions, visit our How to contribute page to get in touch.