Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Which variables are present in the (meta)data (i.e. in the eCRFs)?

  • What are the value ranges for each variable?

In our example, we are working on FAIRifying an already existing metadata of a dataset. Let’s compile and examine the information we have.Variables and Ranges of our metadata are as follows:

Metadata field / Variable

Description of the field

Value range

Dataset Name

...

The name of the dataset.

...

Text

...

Date of Upload

...

The date on which the dataset was uploaded

...

Date values in the format

...

MM/

...

DD/

...

YYYY

Keywords

...

Terms that describe the main topics of the dataset

...

Text

...

Creator

...

The person or organisation that created the dataset

...

Text, in our example title and last name

...

Description

...

A brief summary of the dataset

...

Text

...

Format

...

A file format of the dataset

...

Text, in our example a short string indicating the file format

...

Source

...

The origin of the dataset

...

Text, in our case the name of the institution

...

Rights

...

The usage rights or licence of the dataset

...

Text

...

  1. In case you are aiming at collecting FAIR data from the start:

...

In the example of collecting data on a patient’s 'sex', it might be unclear if it means ‘biological sex at birth' or ‘gender’. In another example, 'age' of a subject can be expressed in years, but in some cases (i.e. studies with small children) could also be expressed in months. It should therefore be clearly stated of age should be captured in years or months.

In the below spreadsheet we can see what the issues are with our current metadata and suggested improvements in order to make the meaning of them clearer

...

Metadata Field

Value

Issue

Suggested Value

Suggested description

Dataset Name

Health Data

Generic and not descriptive.

Patient Health Records 2023

Comprehensive dataset containing health records of patients from Hospital A in the year 2023.

Date of Upload

01/02/2023

Ambiguous format
(MM/DD/YY YYYY or
DD/MM/YYYYYY).

2023-01-02

Date when the dataset was uploaded, in ISO 8601 format (YYY-MM-DD).

Keywords

BP, HR, Conditions

Abbreviations used without context.

Blood Pressure, Heart Rate, Medical Conditions, Hypertension, Diabetes

Keywords describing the main topics covered by the dataset.

Creator

Dr. Smith

Generic name without additional identifying information.

Dr. John Smith, Hospital A

ORCID: 0001-0002-3456-7890

Full name and affiliation of the dataset creator, as well as ORCID.

Description

Patient health data including BP and HR

Lacks detail.

Detailed patient health records including measurements of blood pressure (BP) and heart rate (HR), along with diagnosed medical conditions and prescribed medications.

Extended description providing context and details about the dataset.

Format

CSV

Broad category, can be more detailed.

CSV, version 1.0

Data format and version.

Source

Hospital A

Lacks detail, too generic.

Hospital A, Department of Cardiology

Specific department and institution where the data was sourced.

Rights

Open

Too broad.

CC BY 4.0

Licensing terms specifying the rights for data usage.

...

Compile information about the relationships between data elements. For example, if the dataset is in a relational database, the relational schema provides information about the dataset structure, the types involved (the field names), cardinality, etc.

In the example of For example, the variable 'biological sex at birth' , this variable in a dataset is an attribute of ‘patient’.

Step 4

Check whether the data representation is clear and unambiguous. For data elements with ambiguous format, try to improve their definition. For this, it might help to examine the value range of a variable to find out if next to the intended value range, other values could be filled in, too.

For example, 'age' of a subject can be expressed in years, but in some cases (i.e. studies with small children) could also be expressed in months. It should therefore be clearly stated of age should be captured in years or months.

Step 5In our metadata example above, the Creator (Dr. John Smith) is an employee of the Source (Hospital A, Department of Cardiology) of the dataset.

Step 4

In addition, check whether the data already contains FAIR features, such as persistent unique identifiers for data elements (for more information, see pre-FAIR assessment).

In our example above, the ORCID of the creator is a unique persistent identifier (F1) for a person.

Step 65

Define or align common data elements (CDEs). For a new data collection: define CDEs whose semantics are clear and unambiguous; for an existing data set, existing data elements can be aligned to CDEs.

...