STATUS: IN DEVELOPMENT
Short description
‘… selecting a relevant subset of the data and defining driving user questions(s) are highly relying on being familiar with the data’ (Generic)
In this step, the aim is to gain more insight into the existing data, or the data that you aim to collect. Clearly defining the meaning (semantics) of the data is an important step for creating the semantic model, as well as for data collection via, for example, electronic case report forms (eCRFs).
To understand 'semantics', different aspects of the data elements/variables should be analysed:
the definition/description of data elements
values that are allowed to chose (e.g., in system A, sex allows for male and female, while in system B, sex also allows for intersex. Such difference reflects the gap of their semantics)
relationship between data elements (e.g., ‘sex’ variable is one attribute of ‘patient’ profile, which may imply that the semantics of this ‘sex’ variable is ‘sex of patient’)
The outcome of this step should be a set of data elements (variables) with clear and unambiguous semantics (a codebook), which reflect the information you want to collect or share. Be aware that finding machine-actionable items from ontologies for the data elements is not yet part of this step, but is described in Create or reuse a semantic (meta)data model.
Why is this step important
Several of the Metroline steps that follow rely on being familiar with your data. For example, in order to
While performing this step, keep your
For example, in a dataset a variable to collect sex-related data might be called ‘sex’. If the semantics of such variable is not provided or not analyzed, it would be unclear if it means ‘biological sex at birth', ‘phenotypic sex’, or ‘gender’. These issues have to be solved before you start with the semantic (meta)data model.
How to
To see which steps are relevant for your (meta)data, please follow the diagrams below.
For analysing data semantics:
For analysing metadata semantics:
For easier understanding, we will follow the example dataset containing patient information with the following metadata:
Metadata Field | Value |
---|---|
Dataset Name | Health Data |
Date of Upload | 01/02/2023 |
Keywords | BP, HR, Conditions |
Creator | Dr. Smith |
Description | Patient health data including BP and HR |
Format | CSV |
Source | Hospital A |
Rights | Open |
In the example, we are working with existing metadata, that’s why we will start with Step 1 - Compile information.
Step 1 - Compile information
Compile all the information of data elements, data values, and data structure. Examine the data in whatever format and structure it is available. This step helps to identify inconsistencies, ambiguities, and errors in the data.
a) For existing (meta)data: Locate all relevant sources in which the (meta)data is stored. Compile information about the following:
Which variables are present in the (meta)data (i.e. in the eCRFs)?
What are the value ranges for each variable?
In our example, we are working on FAIRifying an already existing metadata of a dataset. Let’s compile and examine the information we have.
Metadata field / Variable | Description of the field | Value range |
---|---|---|
Dataset Name | The name of the dataset. | Text |
Date of Upload | The date on which the dataset was uploaded | Date values in the format MM/DD/YYYY |
Keywords | Terms that describe the main topics of the dataset | Text |
Creator | The person or organisation that created the dataset | Text, in our example title and last name |
Description | A brief summary of the dataset | Text |
Format | A file format of the dataset | Text, in our example a short string indicating the file format |
Source | The origin of the dataset | Text, in our case the name of the institution |
Rights | The usage rights or licence of the dataset | Text |
b) In case you are aiming at collecting FAIR (meta)data from the start:
Which data elements/variables are you planning to collect? For this, the competency questions (QCs) might provide some guidance.
If possible, determine the value range for each data element (e.g. for ‘biological sex at birth', values could be ‘male’, ‘female’; while for 'age’, the value range might be 0-110)
Step 2 - Check for an existing standard/code book
a) For existing (meta)data: check if it comes with a code book or metadata standard. In case it does and it is clear, can use it for your (meta)data and are done with this step. If the codebook is not helpful, you should contact the owner of the data and get the semantics cleared up, so you don’t misinterpret the data. If you see you still need to do additional work in order to make the data clearer, follow the steps below.
b) For new data: check if there is a code book or metadata standard you can use. If yes, you can use that, if no, follow the next steps.
If you find a codebook or metadata standard that fits partially, use it for the elements that are included and follow the steps below for the others.
Health-RI, together with domain representatives, will be aiming to develop domain-specific national data standards in the future.
You can find more about metadata standards and ontologies at the following link: https://howtofair.dk/links-additional-reading/#more-on-metadata-standards-and-ontologies-
Step 3 - Check data semantics
Check the data semantics. Is the meaning of the data elements clear and unambiguous? For data elements with ambiguous meaning, try to improve their definition. For this, it might help to examine the value range of a variable to find out if next to the intended value range, other values could be filled in, too.
In the example of collecting data on a patient’s 'sex', it might be unclear if it means ‘biological sex at birth' or ‘gender’. In another example, 'age' of a subject can be expressed in years, but in some cases (i.e. studies with small children) could also be expressed in months. It should therefore be clearly stated if age should be captured in years or months.
In the below spreadsheet we can see what the issues are with our current metadata and suggested improvements in order to make the meaning of them clearer.
This recipe in the FAIR cookbook gives some additional guidance on specifying the semantics of elements of your data.
Metadata Field | Value | Issue | Suggested Value | Suggested description |
---|---|---|---|---|
Dataset Name | Health Data | Generic and not descriptive. | Patient Health Records 2023 | The name of the dataset. |
Date of Upload | 01/02/2023 | Ambiguous format | 2023-01-02 | Date when the dataset was uploaded, in ISO 8601 format (YYY-MM-DD). |
Keywords | BP, HR, Conditions | Abbreviations used without context. | Blood Pressure, Heart Rate, Hypertension | Keywords describing the main topics covered by the dataset. |
Creator | Dr. Smith | Generic name without additional identifying information. | Dr. John Smith, Hospital A ORCID: 0001-0002-3456-7890 | Full name and affiliation of the dataset creator, as well as ORCID. |
Description | Patient health data including BP and HR | Lacks detail. | Detailed patient health records including measurements of blood pressure (BP) and heart rate (HR), along with diagnosed medical conditions and prescribed medications. | Extended description providing context and details about the dataset. |
Format | CSV | Broad category, can be more detailed. | CSV, version 1.0 | Data format and version. |
Source | Hospital A | Lacks detail, too generic. | Hospital A, Department of Cardiology | Specific department and institution where the data was sourced. |
Rights | Open | Too broad. | CC BY 4.0 | Licensing terms specifying the rights for data usage. |
Step 4 - Check relationships
Compile information about the relationships between data elements. For example, if the dataset is in a relational database, the relational schema provides information about the dataset structure, the types involved (the field names), cardinality, etc.
For example, the variable 'biological sex at birth' in a dataset is an attribute of ‘patient’.
In our metadata example above, the Creator (Dr. John Smith) is an employee of the Source (Hospital A, Department of Cardiology) of the dataset.
Step 5 - Check for FAIR features
In addition, check whether the data already contains FAIR features, such as persistent unique identifiers for data elements (for more information, see pre-FAIR assessment).
In our example above, the ORCID of the creator is a unique persistent identifier (F1) for a person.
Expertise requirements for this step
Below are experts that may need to be involved, as described in Metroline Step: Build the Team.
Semantic data / modelling specialists: Can help understanding the data’s structure and ambiguity.
Domain expert: Can help understanding the data’s elements and their potential ambiguity.
FAIR data steward: Can help identifying FAIR features in the dataset, or data to be collected, as well as ambiguity in the data elements.
Practical examples from the community
Examples of how this step is applied in a project (link to demonstrator projects).
Training
Relevant training will be added in the future if available.
Suggestions
Visit our How to contribute page for information on how to get in touch if you have any suggestions about this page.