Health-RI wiki v4.0 -> consultatie (open tot 03-12-2024)


Omics working group

datum: 14-08-2024 Status: FOR REVIEW

This article contains the profile of the Omics working group. The Omics profile contains the specific agreements that apply to the Omics data category.

Title

Title

Omics data

General

Profile metadata

2023-09-26 Versie 0.0.1

Release-information

2023-09-26 Auteur: R.W.W. Hooft
2024-04-30

Law and regulations

Legal basis

Human heredity data (which in any case includes both genomics and genetics data: the study of all genes or specific genes) (with specific exceptions) belongs to the special categories of personal data and often cannot be made anonymous. This not only requires a basis for processing under Art 6 GDPR, but also a lifting of the prohibition on processing under Art 9 GDPR.

In the Netherlands, the basic principle is that genetic personal data may only be processed if the data subject has given explicit consent, i.e. via an opt-in variant of consent. In the event of a compelling medical interest or for scientific research, genetic data can sometimes be used without permission. This is only permitted if it is impossible to obtain consent, or if requesting consent requires a disproportionate effort.

For other forms of omics, such as proteomics and metabolomics, although the data comes from a person, with the current state of technology it cannot be traced back to a person without other data. This would have the consequence that only combination with other traceable data poses potential risks and, moreover, combination with this type of omics data does not increase the risk. There are people who expect that conversion will be possible in some cases in the future and who are even more cautious.

Organizational policy

Roles and actors

The article roles describes the generic roles within the Health-RI ecosystem. Within Omics there are specific specifications for the following roles.

Dataproducer

Omics data is collected by three groups:

  1. Scientific researchers; as data to investigate

  2. Clinical diagnostics; for example for diagnosis of rare diseases, but also for support of diagnosis, for risk analysis, or for genetically determined pharmacokinetics.

  3. Biobanks; as data determined based on the biobank samples, so that it can be issued digitally where appropriate.

In addition, it happens that individuals have their own omics data collected, for example by commercial providers of those types of services; At this time, these data are not yet in scope.

Data Governance Committee

There is currently not much coordination about data governance for omics data. As part of the European 1+MG initiative and the GDI project, an infrastructure for sharing high-quality human genome data is being developed. This is still in an early phase of realization. It is already clear that this infrastructure will set up a central European Data Access Commission that will judge the issuance of data. The European Health Data Space Regulation (EHDS) will also regulate the exchange of genomics and proteomics data, including for secondary use: countries will be able to impose their own national conditions for the use of this data in addition to the conditions set by the EHDS regulation.

Inclusion and exclusion criteria for participants

There are different types of omics, each with specific properties of the data. The first focus is on “genomics”. For Genomics data, the consensus is that anonymization is not possible; re-identification is relatively simple. That is why the GDPR always applies to genomics data, and on top of that the Art 9 GDPR prohibition (because it concerns special personal data) and also the clearly stated principles that DNA data may only be processed on the basis of consent (opt-in). worked.

Genomics data, especially in raw form, is very large volume data (hundreds of gigabytes for a whole-genome sequencing dataset for a single person).

Genomics data is grouped into “cohorts” based on the use of the measurement technique and especially on the basis of the associated phenotypic data: the genome data itself has the same form largely regardless of the purpose of the determination, so it is the other data of the person that are available that determine the grouping (e.g. it concerns all patients from the cardiology department at the UMCU, for which very comparable other data is also available).

Information

Metadata

The article minimal (meta)dataset describes the generic (meta)dataset within the Health-RI ecosystem. The following addition exists within Omics.


The minimum is currently DCAT version 2.0. All domain-specific metadata fields described below will only become part of the metadata model in later plateaus.

For all omics data, the following common metadata fields apply on top of the health-ri metadata:

  • Metadata describing the sample (e.g. tissue or blood) on which the omics determination was performed. This metadata must be compatible with the data recorded for biobanks.

  • Omics type

The genomics sunflower leaf contains the following additional metadata fields:

  • genome origin: somatic or germline

  • genome coverage (which part of the DNA is described):

  • accuracy (e.g. as number of expected errors per million base pairs)

  • availability of raw data.

  • reported genetic variation such as mutations, indels, structural variants.

  • frequencies of variants in the dataset.

  • ID of Reference Genome

Information standards

File formats for genomics are::

  • VCF (difference from reference genome, 1GB),

  • BAM (raw data + processing),

  • FASTQ (raw data),

  • CRAM (content specific compressed version of BAM)

For more information see the article on omics datatypes and standards

Coverage:

  • TES Targeted Exome Sequencing

  • WES Whole Exome Sequencing

  • WGS Whole Genome Sequencing

Application / IT-infrastructure

Ways of data exchange

Preference for federated processing (in proximity to storage by the data holder) due to

  • the size of the files

  • privacy sensitivity of the data: the data not only says something about a person but also something about his immediate family.

There is a special standardized protocol “htsget” that can provide specific access to the necessary parts of genomic data, so that as little copying as possible is required.

This has not yet been worked out for other omics data.

A European infrastructure for exchanging human genome data is being built in the GDI project in which Health-RI participates on behalf of the Netherlands.

Implementation

There are many tools used for Omics data, including:

  • Armadillo (uses Data Shield)

  • (TES [Task Execution Service] and WES [Workflow Execution Service] API)

  • Beacon v1: search by genome value

  • Beacon v2: search by genome value AND patient information

  • Molgenis EMX2: data management tool that includes FAIR best-practice models for genomics and built-in FDP, Beacon v2 and RDF

  • Galaxy: workflow tool that can be used without bioinformatics expertise, and in which many genome analysis tools are available. There are public agencies, but they are not suitable for analysis of human genome data because the security of the data is not sufficiently guaranteed.

  • cBioPortal: visual gene analysis tool aimed at cancer research

 

image-20240501-074337.png
Application view GDI Starter kit

Security

Assessment of anonymization

For this type of data, anonymizing individual data is impossible: it concerns not only a person but also his immediate family. It is easier to get enough context information to identify a subject.

Anonymization through aggregation is possible: when it is indicated for a (sufficiently large) group of data subjects which genetic variants have been observed in the group, it is no longer possible to trace them back to individuals.

The “beacon” protocol is a world standard for interrogating genome data. For V1 of this protocol it has also been determined that when more than approximately 200 questions are asked, it becomes possible to reidentify a subject based on the answers. Such an analysis is not yet formally known for V2 of the protocol, but it is already clear that the number of required queries will be significantly lower, perhaps around 20.

Additional privacy measures

For the purpose of identification, authentication and authorization, the use of passports and visas is recommended, which must be supported by LS AAI and (within the Netherlands) SRAM.

Furthermore, the systems for international exchange of genetic data are focused on the use of encryption: data is stored encrypted where possible, and granting access mainly consists of temporarily providing access to a decryption key that specifically contains only the necessary parts of the data can be decrypted.