Health-RI wiki v4.0 -> consultatie (open tot 03-12-2024)


Persistent Identifiers

datE:13-11-2024 Status: UNDER DEVELOPMENT

This article is a discussion paper about persistent identifiers, in particular applied to datasets. There are opportunities at Health-RI to include good agreements in the agreement system about the use of persistent identifiers for datasets. It is good to sit down with experts and potential suppliers for this.

What is a Persistent identifier?

A persistent identifier, usually abbreviated as PID, is a series of characters that can be used to consistently refer to a (digital) object for a long time, such as a data file or a dataset (focus on this page), a person or an organization . A good PID has several properties:

  • It is worldwide (universally) unique. This means that no one else in the world uses the same PID to indicate anything else, and also that a PID will never be reused to point to another object. Unique does not mean that each object or concept has only 1 PID: different PIDs can refer to the same object.

  • It is possible for both people and machines to immediately identify that it is a PID.

  • There is a system (a resolver) that can convert the PID into a reference to the object or concept. For a digital object that is a URL. The resolver can often also store other metadata about the object (see: kernel metadata).

  • A PID system therefore needs to be updated when the URL changes. We will come back to this later in this document.

  • The string of characters that form the PID has no deeper meaning, there is no other semantics besides the recognition as a PID. Every semantics has the risk of being temporary and therefore becoming outdated. It is a sign of very bad architecture if a PID is composed of semantic parts, unless it concerns the composition of an organization-wide prefix and an internal identifier into a PID.

The FAIR principles require a PID for data and a PID for metadata. This is rarely done in practice: there is often only 1 PID that is described as the PID for the data, but in practice refers to a metadata page on which a machine cannot automatically find the data.

Also see:

Kernel Metadata

Because the digital objects that PIDs can refer to are quite different from each other, it is sometimes useful to know a little more before looking up the actual object. It is therefore possible to store some metadata about the PID with the resolver. This metadata is called PID Kernel Metadata (also called PID metadata, identifier record or identity record). PID Kernel Metadata follows a profile created for this purpose, which must be coordinated as much as possible within a broad community.

For more information about PID Kernel Metadata, see: https://doi.org/10.15497/rda00031

Within Health-RI we will either have to adopt an existing PID Kernel Metadata profile, or create one for ourselves based on the experiences of others.

The kernel profile should probably include:

  • URLs pointing to the described object

  • URLs that point to the metadata describing the object

  • The type of object the PID refers to. Types can be defined at multiple levels:

    • The format of the file (e.g. “JPEG”) that defines how the bytes in the object encode the content.

    • The type of information: is it, for example, a service, a workflow, or a data file?

    • Which shows the contents of the file. For example: Is it a photo of a mountain, or of a person?

When creating a PID Kernel metadata profile, it must be taken into account that it is more laborious to update this information than updating metadata in your own database, and that is why users of PID kernel metadata can never rely 100% on it. that metadata is updated.

For an example see also: A common PID Kernel Information Profile for the German Helmhol...

Responsibilities of the data holder

A data holder who creates a PID also undertakes the obligation to continue to update the associated metadata as necessary. This is an important reason why we at Health-RI cannot enter into one contract with a supplier of PID systems: each data holder must manage its own metadata.

Semantics of the resolver

There are often different ways in which the resolver can be addressed. The basic functionality consists of 2 parts: (a) the actual resolving of the PID to the current location of the object, and (b) retrieving the information available about the PID, the Kernel metadata.

To ensure interoperability, it is important that good agreements are made about these processes within an organization such as health-RI. This includes:

  • When a PID is resolved, where does the redirect come from? Is that directly on the database? On the metadata of that data file, and if so, what form is that metadata in? Or on an HTML “landing page”? Even more important than the actual behavior is that all PIDs in the organization do this in the same way. When making a choice, it is important to also enable automatic processes; for example, an HTML page is often difficult to interpret by a machine. (FAIR Principle A1)

  • What persistence is guaranteed by a PID? Infinite validity can never be guaranteed, but what is the term that can be promised? And what happens to the PID if the data is no longer available? What information is on the “tomb stone”? (FAIR Principle A2)

There may be other semantic aspects to resolving the metadata, details of which we would have to ask experts.

Why do we need persistent identifiers?

Although it sometimes seems like identifiers are only used within an organization or a limited group of organizations, it quickly becomes a good idea to use universal PIDs instead of locally generated and maintained identifiers:

  • Referring to the digital objects and their metadata therefore becomes independent of the location where this is done.

  • Standardizing the PID semantics can also simplify the software and make it robust for the future (for example: if all references to a digital object are formed by PIDs, new types of digital objects do not have to be built in everywhere; in many places it is sufficient with the storage and transmission of a PID).

Used in this way, PIDs are a good way to use any place where a digital object needs to be referenced. An example is for the request process for access to a dataset: the catalog can insert the result of a search without ambiguity into the data request as a list of PIDs. Recording relationships between datasets in the catalog (supported by DCAT) is also best done with PIDs.

PIDs can also be used in the catalog to store the results of a search query for precise retrieval later for repeatability or to repeat the same query at a later time on a catalog with more data sets (see: Data Citation of Evolving Data: Recommendations of the Working Group on Data Citation (WGDC)). 

Where do we get persistent identifiers from?

The quality of PIDs is based on a piece of specific infrastructure, which must run with high availability (redundant) and reliability for a long time (FAIR principle A2). Although it is in principle possible to keep that infrastructure in your own hands, with the availability of professional PID services it is much better to purchase it. Such services are available with different cost structures, with different balances of a fixed subscription price and a price per entered identifier.

Commonly used PID systems are based on the Handle system that was invented in 1995 by Bob Kahn, one of the inventors of TCP/IP. Because HTTP had only just started at that time, the Handle system was set up in such a way that it can use HTTP (the usual resolvers are based on HTTP), but in principle does not rely on it. Even if HTTP ever falls out of favor, the Handle system and all associated PIDs are still usable.

A first overview of some options for obtaining Handles is:

  • ePIC, “Persistent Identifiers for E-Research". SURF is one of the providers here. See Welcome and Persistent Identifiers. As an organization that purchases this service, you are assigned a fixed prefix and can create as many PIDs as you want within it for a fixed amount of €45 per year.

  • DataCite, Create DOIs - DataCite. An organization pays an amount per year, and a limited number of PIDs can be created for this. Up to 1600 DOIs per year cost €500 per year to participate plus €0.80 per DOI. For very large numbers of DOIs it costs €500+€25500 per year for 10 M DOIs, or €0.0026 per DOI.

     

Further considerations when choosing are:

  • DataCite identifiers are DOIs and start with 10.xxxx; This is recognized by even more people than other Handles and DOIs are already widely used for databases in addition to scientific literature. ePIC identifiers are also Handles, but start with a different number.

  • DataCite and other DOI providers offer different kernel metadata options than the other Handle-based systems.

Agreements that we must make in Health-RI about Persistent Identifiers

Within Health-RI we are a group of organizations. As part of our system of agreements, we can arrange a number of matters for PIDs.

  • In any case, we must make a choice for the semantics of the resolver, so that the software tools within Health-RI can take this into account.

  • We also must agree on the PID kernel metadata for the same reason.

  • We may also negotiate contract terms with a PID services provider offered to Health-RI partners.

  • Under no circumstances should different organizations in Health-RI create all PIDs under one central contract with a supplier. Each data holder must enter into a contract with a PID provider (see above the section on the responsibilities of a data holder).

  • We can join a national group (or set up such a group?) for coordination between organizations so that better national interoperability can be achieved.

Other references