...
In this step, the aim is to gain more insight into the existing data, or the data that you aim to collect. Clearly defining the meaning (semantics) of the data is an important step for creating the semantic model, as well as for data collection via, for example, electronic case report forms (eCRFs).
To understand “semantics”, data values (i.e. meaning of data elements), data representation (format), and structure information (i.e. relationships between data elements) should be analysed. The goalis 'semantics', different aspects of the data elements/variables should be analysed:
the definition/description of data elements
values that are allowed to chose (e.g., in system A, sex allows for male and female, while in system B, sex also allows for intersex. Such difference reflects the gap of their semantics)
relationship between data elements (e.g., ‘sex’ variable is one attribute of ‘patient’ profile, which may imply that the semantics of this ‘sex’ variable is ‘sex of patient’)
The outcomeof this step should be a set of data elements with clear and unambiguous semantics, which reflect the information you want to collect or share.
Why is this step important
Even though this step has no clearly defined deliverable, several Several of the steps that follow rely on being familiar with your data. For example, in order to create or reuse your semantic (meta)data model, it is important crucial to understand the elements meaning and structure of your existing data, or data to be collected. Furthermore, a good understanding of your data is closely connected to the FAIRification goals, since these can depend on the data elementsrelationships of variables.
For example, in a dataset a variable to collect sex-related data might be called ‘sex’. If the semantics of such variable is not provided or not analyzed, it would be unclear if it means ‘biological sex at birth', ‘phenotypic sex’, or ‘gender’.
How to
While performing this step, keep your
...
Compile all the information of data valueselements, data formatvalues, and data structure together. Examine the data in whatever format and structure it is available.
In case you are FAIRifying existing/already collected data, locate all relevant sources in which the data is stored. Compile information about the following:
Which variables are present in the data (i.e. in the eCRFs)?
What are the value ranges for each variable?
In case you are aiming at collecting FAIR data from the start:
Which data elements/variables are you planning to collect? For this, the driving user’s question might provide some guidance.
If possible, determine the value range for each data element (e.g. for ‘biological sex at birth', values could be ‘male’, ‘female’; while for 'age’, the value range might be 0-110)
Step 2
Check which data elements are present, and what their relation isthe data semantics. Is the meaning of the data elements clear and unambiguous? For data elements with ambiguous meaning, try to improve their definition. For this, it might help to examine the value range of a variable to find out if next to the intended value range, other values could be filled in, too.
In the example of collecting data on a patient’s 'sex', it might be unclear if it means ‘biological sex at birth' or ‘gender’.
Step 3
Compile information about the relationships between data elements. For example, if the dataset is in a relational database, the relational schema provides information about the dataset structure, the types involved (the field names), cardinality, etc.
Step 3
Check the data semantics. Is the meaning of the data elements clear and unambiguous?In the example of 'biological sex at birth', this variable is an attribute of ‘patient’.
Step 4
Check whether the data representation is clear and unambiguous. Investigate which types of data are present. For data elements with ambiguous format, try to improve their definition. For this, it might help to examine the value range of a variable to find out if next to the intended value range, other values could be filled in, too.
For example, 'age' of a subject can be expressed in years, but in some cases (i.e. studies with small children) could also be expressed in months. It should therefore be clearly stated of age should be captured in years or months.
Step 5
In addition, check whether the data already contains FAIR features, such as persistent unique identifiers for data elements (for more information, see pre-FAIR assessment).
...
Below are experts that may need to be involved, as described in Metroline Step: Build the Team.
Data specialist. Semantic data / modelling specialists: Can help understanding the data’s structure and ambiguity.
Domain expert.: Can help understanding the data’s elements and their potential ambiguity.
FAIR data steward: Can help identifying FAIR features in the dataset, or data to be collected, as well as ambiguity in the data elements.
Practical examples from the community
...