...
The outcome of this step should be a set of data elements (variables) with clear and unambiguous semantics, which reflect the information you want to collect or share. Be aware that finding machine-actionable items from ontologies for the data elements is not yet part of this step, but is described in creating the semantic Create or reuse a semantic (meta)data model.
Why is this step important
Several of the steps that follow rely on being familiar with your data. For example, in order to create or reuse your semantic (meta)data model, it is crucial to understand the meaning and relationships of variables.
While performing this step, keep your
For example, in a dataset a variable to collect sex-related data might be called ‘sex’. If the semantics of such variable is not provided or not analyzed, it would be unclear if it means ‘biological sex at birth', ‘phenotypic sex’, or ‘gender’.
How to
While performing this step, keep your
How to
Let’s say we have a dataset containing patient information with the following metadata:
...
In the below spreadsheet we can see what the issues are with our current metadata and suggested improvements in order to make the meaning of them clearer.
Metadata Field | Value | Issue | Suggested Value | Suggested description |
---|---|---|---|---|
Dataset Name | Health Data | Generic and not descriptive. | Patient Health Records 2023 | Comprehensive dataset containing health records of patients from Hospital A in the year 2023. |
Date of Upload | 01/02/2023 | Ambiguous format | 2023-01-02 | Date when the dataset was uploaded, in ISO 8601 format (YYY-MM-DD). |
Keywords | BP, HR, Conditions | Abbreviations used without context. | Blood Pressure, Heart Rate, Hypertension | Keywords describing the main topics covered by the dataset. |
Creator | Dr. Smith | Generic name without additional identifying information. | Dr. John Smith, Hospital A ORCID: 0001-0002-3456-7890 | Full name and affiliation of the dataset creator, as well as ORCID. |
Description | Patient health data including BP and HR | Lacks detail. | Detailed patient health records including measurements of blood pressure (BP) and heart rate (HR), along with diagnosed medical conditions and prescribed medications. | Extended description providing context and details about the dataset. |
Format | CSV | Broad category, can be more detailed. | CSV, version 1.0 | Data format and version. |
Source | Hospital A | Lacks detail, too generic. | Hospital A, Department of Cardiology | Specific department and institution where the data was sourced. |
Rights | Open | Too broad. | CC BY 4.0 | Licensing terms specifying the rights for data usage. |
...
In addition, check whether the data already contains FAIR features, such as persistent unique identifiers for data elements (for more information, see pre-FAIR assessment).
...