Work plan for creating datasets

DATE: 23-08-2024 STATUS: ADOPTED

This article describes work processes to arrive at suitable datasets that can be included in the catalog in the Health-RI ecosystem.

Definition

In the Health-RI wiki (V2.0) the definition of a dataset states:

A dataset is a collection of comparable data relating to a group of data subjects. The collection has a certain uniformity, such as the presence of certain data items or data types, and similar data acquisition and processing techniques, so that it makes sense to view the dataset as a group that can be drawn upon for reuse.

A dataset can be static, which means that the dataset no longer changes after delivery. On the other hand, a dataset can also be dynamic: the dataset is then subject to change and/or can be supplemented. In that case, the list of data subjects described by the dataset may also change.

A dataset is a collection of comparable data relating to a group of data subjects. The collection has a certain uniformity, such as the presence of certain data items or data types, and similar data acquisition and processing techniques, so that it makes sense to view the dataset as a group that can be drawn upon for reuse.

A dataset can be static, which means that the dataset no longer changes after delivery. On the other hand, a dataset can also be dynamic: the dataset is then subject to change and/or can be supplemented. In that case, the list of data subjects described by the dataset may also change.

Properties of a good dataset in the catalog

A good dataset for the catalog is suitable for carrying out a proper selection of candidate datasets for reuse at the dataset level, so that only a small number of datasets need to be further searched at the subject level. The process is described in Storyline: Search data in metadata.

Ultimately, specific data for reuse will be requested from the selected data subjects. Together we call these requested data a virtual cohort.

It is important for the search and application process:

That the dataset is not too small:
- If the dataset is too small, information about individuals can be derived from the aggregated metadata.
- A reuser does not like to be in a position where he has to combine a large number of data sets, each with a small number of data records. No matter how hard we try, there will always be work to harmonize the data from different sets.
That the dataset is not too large:
- If a dataset describes too diverse a set of subjects, it will be in the search results for almost every reasonable question that can be asked of the catalog. This must be searched again and again at subject level for the really interesting data subjects. For a dataset that has the right size, the aggregated metadata is well representative of the data.
That the dataset describes a relatively homogeneous set of data subjects:
- The data subjects have one or more essential properties in common.
- The information for each data subject should consist of virtually the same variables.
- The data for the different data subjects is collected in a similar way, and also treated in the same way.

Datasets can change over time

A dataset can be a static entity (for example related to a completed study) but also change over time (for example if the source is a care department in a hospital). Different types of changes can occur:

Additional data subjects may be added.
(Longitudinal) data can be added per data subject.
Data subjects can be excluded (by exclusion, raising an objection or withdrawing consent)

In order to work with dynamic datasets in the catalog, it is important that the metadata is regularly updated, preferably with an automated process. It is also essential that there is a possibility for versioning: for the reproducibility of studies, it must be possible to find out what the dataset was at an earlier time. The RDA has a useful set of guidelines for this:

Citing dynamic data: Data Citation of Evolving Data: Recommendations of the Working Group on Data Citation (WGDC)
Data Versioning: Versioning Data Is About More than Revisions: A Conceptual Framework and Proposed Principles | Data Science Journal

It must be impossible to deduce personal data from added subjects from the evolution of a dataset in the catalogue! If a dataset has 42 subjects yesterday and 43 subjects today and no measures have been taken, then the data from the 43rd subject may leak out in too much detail from the differences. Suitable work processes for this (for example by introducing small variations in the metadata) must be established.

Sources of datasets

Datasets for Health-RI can come from different sources: we are currently working on including research data, primary care data, and biobanks or collections. For each of these, a method is developed below to arrive at suitable data sets.

Datasets from research data

Data collected for research has already been selected based on a previous research question, and is sufficiently homogeneous. Therefore, no detailed process description is required for creating data sets. Multiple data sets can be derived from a single study:

It is possible that data from different stages of the analysis (raw data, processed data, processed data, and results) lead to different data sets.
It is possible that different data modalities (clinical data, different types of imaging analysis, and different types of molecular data [omics]) lead to different data sets.
Data sets that have been obtained for research as a virtual cohort from the health-RI catalog or other similar sources can also be prepared as a source data set for other studies.

Different datasets from the same study must be recognizable as such. This can help later to relink the data for reuse. For this purpose, it is useful if research projects are provided with persistent identifiers (e.g. in a “study catalogue”) that can be used to refer to from the datasets.

Datasets from healthcare

Data collected during a healthcare process is often collected in a system such as an electronic patient file, which can form a heterogeneous data source for reuse. Such a data source is often too heterogeneous to be offered as a single dataset (it is “too large”, see above). If that is the case, it is wise to segment the data.

We can think of this as “prospectively building datasets for retrospective research”: it requires that we have an idea of which aspects of the data are important dimensions for segmenting the data for possible future applicants. With the correct subdivision we make the data more visible and findable.

This can be done in various ways:

By specialty: For example, if an EPD only looks at cardiology patients, the dataset will be much more homogeneous.
Per disease: If a specialty is still considered too broad, the same can be done for diseases that form part of a specialty (e.g. “cardiomyopathy”)
If necessary, a further breakdown can be made per treatment method; this is especially useful if data collection also takes place in the context of the treatment)
Sometimes researchers (or registries) are looking for patients with a specific treatment result. For such cases it can be useful to have segmented accordingly.
Further breakdown options may be added here in the future based on practical experience with the catalog.

A possible way to make the segmentation of a data source such as an EPD available is not to extract the actual data from the EPD, but only to periodically run a script that creates the description of such a subset (i.e. the metadata) based on from a scan of the EPD.

DCAT metadata allows us to indicate that a dataset in the catalog is a subset of another dataset. It is therefore not necessary to do the segmentation “just at the right level”: one can create multiple segmentations of a source and link the datasets together by “subset” declarations. However, the functionality for the catalog to use this functionality still needs to be developed (status March 2024).

Datasets from biobanks / collections

Data from biobanks and collections often have properties somewhere between those of data from research and data from healthcare. There is greater homogeneity in the data than in healthcare data sources, but it can still be useful to compile datasets that form a subset of the data in a biobank. This can be done along the same dimensions as indicated for data sets from healthcare.