...
While performing this step, keep your
For example, in a dataset a variable to collect sex-related data might be called ‘sex’. If the semantics of such variable is not provided or not analyzed, it would be unclear if it means ‘biological sex at birth', ‘phenotypic sex’, or ‘gender’. These issues have to be solved before you start with the semantic (meta)data model.
How to
Let’s say we have a To see which steps are relevant for your (meta)data, please follow the diagram below. For easier understanding, we will follow the example dataset containing patient information with the following metadata:
Metadata Field | Value |
---|---|
Dataset Name | Health Data |
Date of Upload | 01/02/2023 |
Keywords | BP, HR, Conditions |
Creator | Dr. Smith |
Description | Patient health data including BP and HR |
Format | CSV |
Source | Hospital A |
Rights | Open |
[placeholder diagram]
In the example, we are working with an existing (meta)data, that’s why we will start with Step 2 - Check for an existing standard/code book.
Step 1 - Compile information
Compile all the information of data elements, data values, and data structure. Examine the data in whatever format and structure it is available. This step helps to identify inconsistencies, ambiguities, and errors in the data.
...
Which data elements/variables are you planning to collect? For this, the driving user’s question might provide some guidance.
If possible, determine the value range for each data element (e.g. for ‘biological sex at birth', values could be ‘male’, ‘female’; while for 'age’, the value range might be 0-110)
Step 2 - Check for an existing standard/code book
a) For existing data: check if it comes with a code book. If yes: does it help? If yes, you’re done, if no: Contact the owner of the data and get the semantics cleared up, so you don’t misinterpret the data. If you see you still need to do additional work in order to make the data clearer, follow the steps below.
b) For new data: check if there is a code book you can use. If yes: use that, if no: follow the steps (paragraph below)
Check if there is an existing data standard or code book that you can reuse. If there is, use it, otherwise follow the steps below. If you find a codebook that fits partially, use it for the elements that are included and follow the steps below for the others. Health-RI, together with domain representatives, will be aiming to develop domain-specific national data standards in the future.
Step 3 - Check data semantics
Check the data semantics. Is the meaning of the data elements clear and unambiguous? For data elements with ambiguous meaning, try to improve their definition. For this, it might help to examine the value range of a variable to find out if next to the intended value range, other values could be filled in, too.
...
Metadata Field | Value | Issue | Suggested Value | Suggested description |
---|---|---|---|---|
Dataset Name | Health Data | Generic and not descriptive. | Patient Health Records 2023 | The name of the dataset. |
Date of Upload | 01/02/2023 | Ambiguous format | 2023-01-02 | Date when the dataset was uploaded, in ISO 8601 format (YYY-MM-DD). |
Keywords | BP, HR, Conditions | Abbreviations used without context. | Blood Pressure, Heart Rate, Hypertension | Keywords describing the main topics covered by the dataset. |
Creator | Dr. Smith | Generic name without additional identifying information. | Dr. John Smith, Hospital A ORCID: 0001-0002-3456-7890 | Full name and affiliation of the dataset creator, as well as ORCID. |
Description | Patient health data including BP and HR | Lacks detail. | Detailed patient health records including measurements of blood pressure (BP) and heart rate (HR), along with diagnosed medical conditions and prescribed medications. | Extended description providing context and details about the dataset. |
Format | CSV | Broad category, can be more detailed. | CSV, version 1.0 | Data format and version. |
Source | Hospital A | Lacks detail, too generic. | Hospital A, Department of Cardiology | Specific department and institution where the data was sourced. |
Rights | Open | Too broad. | CC BY 4.0 | Licensing terms specifying the rights for data usage. |
Step 3 - Check relationships
Compile information about the relationships between data elements. For example, if the dataset is in a relational database, the relational schema provides information about the dataset structure, the types involved (the field names), cardinality, etc.
...
In our metadata example above, the Creator (Dr. John Smith) is an employee of the Source (Hospital A, Department of Cardiology) of the dataset.
Step 4 - Check for FAIR features
In addition, check whether the data already contains FAIR features, such as persistent unique identifiers for data elements (for more information, see pre-FAIR assessment).
In our example above, the ORCID of the creator is a unique persistent identifier (F1) for a person.
Step 5 - Define/align common data elements [removed?]
Define or align common data elements (CDEs).
...