Minimal (meta)dataset

DATE: 23-08-2024 STATUS: ADOPTED

The minimal (meta)dataset is built according to a sunflower model. The generic (meta)dataset, which applies to all datasets, is the core of a sunflower. For each condition and domain, a specific, minimal (metadata) set is then defined, in collaboration with the parties involved in healthcare: the petals of the sunflower.
DCAT (Data Catalog Vocabulary) was chosen as the basis for the generic metadataset.

A "data governance committee” will take care of the creation and management of the minimum (meta) datasets, with definition of the coding and modeling according to the founder of unity of language.

This will mainly fall under the responsibility of healthcare, with the Health-RI ecosystem participating to ensure that this minimal (meta)data set is also usable for research and innovation.

Context: Description of datasets versus description of the data

A lot of work has already been done in the field of metadata that describes data. For example, patient data is annotated with demographic data, and image data and omics data can be annotated with the disease data of the person concerned. The metadata needed to describe a dataset in a catalog is of a different order: an individual person has an age and a gender, but a dataset does not. Nevertheless, when searching for suitable existing data for the study, a researcher wants to be able to distinguish a geriatric data set from a neonatal data set. It must therefore be agreed for the catalog how, for example, an age distribution of a population can be recorded and how this can be searched for in the catalog. And this also applies to many other metadata at the dataset level. There is not much experience with this yet.

Minimal metadataset

Each dataset must have a certain amount of metadata linked to it in order to make the dataset FAIR: findable, accessible, interoperable and reusable. Following the example of the European Health Data Space (EHDS), Health-RI opts for the DCAT-AP standard to describe metadata and provides a growth path for this. In the future DCAT AP v3 will be required, for now it is DCAT AP v2.

Initially, it is sufficient to fill in a minimum number of mandatory DCAT-AP fields so that the dataset can be included in the catalog.

Later, this minimal set will be expanded with an additional layer and with specific metadata that may differ per focus area (e.g. domain, disorder, funder).

These three stages of growth are depicted below in the sunflower metaphor, which includes the core and area-specific leaves.

 

image-20240718-111749.png

 

The technical specification, containing information about the minimum, mandatory metadata fields, can be found on Github.

Minimal dataset

Each disease-specific domain has its own processes with associated data. Analogous to the metadata sunflower model, a generic dataset will be defined (data that occurs in almost all disorders) and domain-specific data.

When a dataset is requested, in principle only the data elements that are requested are delivered (data minimization).