Metroline Step: Select identifier scheme
Status: READY FOR REVIEW
'Data that are not discoverable cannot be reused, and data that cannot be reused are not FAIR.' (FAIR Principles paper)
In simple terms, selecting an identifier scheme is like deciding how to label books in a library. If every book has a clear catalogue number, author, and subject, readers can always find what they need. Without a consistent scheme, books would be scattered and confusing. For datasets, choosing the right identifier scheme ensures that your data can be discovered, cited, and reused. Similar to well-organised books that are easy to find in a structured library.
Short description
Identifiers are the anchors of FAIR (findable, accessible, interoperable, reusable) data. They give each resource, whether a dataset or an element in a metadata schema, a unique and reliable point of reference. In FAIR settings this applies across datasets, metadata records, people, organisations, licences and even individual data values. Effective identifiers must be globally unique, persistent, machine actionable and resolvable, and they should be used consistently so that resources can be connected without ambiguity. Domains such as chemistry show the value of this clearly because identifiers like InChI or SMILES capture molecular structures in a reproducible way that allows data to be shared and interpreted across tools and disciplines.
In practice, responsibilities for identifier management are shared. Most technical and policy decisions, such as a) selecting identifier schemes, b) registering namespaces, or c) creating mappings between schemes, are typically supported by data stewards or research support staff. For researchers, the most important tasks are to reuse existing identifiers where possible, apply them consistently and consult support services when new identifiers are required. At the same time, researchers remain responsible for the content of their metadata and data, and should be involved in decisions about updates that may require versioning or the minting of a new identifier.
Given the crucial role of persistent identifiers in enabling FAIR data and, in turn, a robust national health-data infrastructure, Health-RI is also developing guidance on PIDs. More details will be added as soon as they become available.
Below are 5 widely used identifier systems across FAIR and open‑science communities. They illustrate the range of resolvable, persistent identifiers commonly applied in research infrastructures.
Identifier | What it Identifies | Resolvable | Typical Use |
|---|---|---|---|
DOI (DataCite) | Datasets, publications, software | Yes | Citing datasets |
ORCID | Individual researchers | Yes | Identifying contributors |
ROR ID | Research organisations | Yes | Standardising affiliations |
ISSN / ISBN | Journals (ISSN), books (ISBN)Yes | Yes | Identifying publications |
RAID | Research projects and activities | Yes | Linking outputs to projects |
Why is this step important
Persistent identifiers matter because they provide the stability and clarity that allow data and metadata to function reliably in FAIR ecosystems. This is enabled through the key properties described below.
Globally unique. Prevents collisions so a resource is never confused with another, which supports reliable discovery and stable referencing across systems.
Persistent. Remains valid over time, even when systems or storage locations change, which keeps data citable and usable in the long term.
Machine actionable. Has a defined structure that software can interpret directly, which removes ambiguity and enables automation.
Resolvable. Can be followed through a recognised protocol to metadata or to the resource itself, which strengthens discoverability and accessibility.
How to
Below are nine principles for implementing identifiers in FAIR data management. Options 1–8 are “Researcher-focused steps” whilst option 9 can be an “Institutional requirement”.
Recommendation 1 - Reuse community identifiers whenever possible
Reusing existing identifiers prevents duplication and immediately connects your data to the wider research ecosystem. Many entities already have authoritative identifiers that you can adopt directly.
Benefit: Strengthens interoperability and aligns your dataset with established knowledge graphs.
How: Check registries for existing identifiers before creating new ones. For example, ORCID is commonly used for creators, ROR for institutions, and FundRef for funders. More, http://Identifiers.org helps locate identifiers for many scientific entities. At HRI, internal guidance is being developed to harmonise identifier use.
Example:
# Reusing existing identifiers creator: "Jane Doe" # ORCID for researcher creator.orcid: https://orcid.org/0000-0001-5109-3700 organization: "Freie Universität Berlin" # ROR identifier organization.ror: https://ror.org/046ak2485 funder: "National Science Foundation" # FundRef identifier funder.fundref: https://doi.org/10.13039/100000001
Recommendation 2 - Mint new identifiers when none exist
When no suitable identifier is available, assigning a persistent identifier to a digital object (minting) ensures the resource has a unique and persistent reference point. This is essential for stable reuse and accurate citation.
Benefit: Ensures the resource has a persistent and widely recognisable identifier to support reliable citation and reuse.
How: Assign a persistent identifier to a digital object when no suitable existing identifier is available. For example, DataCite can be used to mint DOIs, and MINIDs provide lightweight identifiers (see FAIR Cookbook FCB006 and FCB008 for guidance).
Example: (MINID).
# Creating a new MINID minid --register --title "Dataset X" DatasetX.tar.gz --locations http://example.org/DatasetX.tar.gz
Recommendation 3 - Assign an identifier to each dataset
Every dataset needs a stable anchor that supports discovery, citation and long-term accessibility. Assigning a PID formalises the dataset as a citable and traceable resource. This typically occurs when the dataset is published in a repository, rather than during data collection.
Benefit: Catalogue services and automated workflows can reliably reference the dataset over time.
How: Register the dataset metadata with a PID authority such as DataCite or Crossref to obtain a persistent identifier (PID) that functions as the dataset’s primary reference. Other PID systems may also be suitable depending on the repository or community practices.
Recommendation 4 - Use identifiers consistently throughout the metadata
Metadata should reference people, organisations, licences and related resources using standard identifiers. This removes ambiguity and allows machines to interpret relationships correctly.
Benefit: Removes ambiguity in metadata, improves interoperability and supports automated processing across systems.
How: Reference people, organisations, licences, and related resources using standard identifiers consistently. For example, ORCID can be used for authors, ROR for organisations, and SPDX identifiers identifiers for licences. Other identifiers may be used depending on community norms or repository requirements.
Example:
# Metadata snippet with persistent identifiers creator: "Jane Doe" # ORCID for researcher creator.orcid: https://orcid.org/0000-0001-5109-3700 organization: "Freie Universität Berlin" # ROR identifier organization.ror: https://ror.org/046ak2485
Recommendation 5 - Assign identifiers to relevant data entities and values
Internal elements such as records, variables, or samples may benefit from stable identifiers, but not every element requires one. Only assign identifiers where reuse, integration, or automated processing is expected. When in doubt, consult a data steward to decide which elements should receive identifiers.
Benefit: Allows tools to reference and interpret the individual records and values inside a dataset reliably, which supports automation and reuse across datasets.
How: For data values, use controlled vocabularies or domain ontologies where possible. For example, RRIDs can identify biological resources, and InChI or SMILES can represent chemical structures. These are examples; other domain-specific identifiers may be appropriate depending on the dataset and community practices.
Example:
# Identifiers for internal entities and values sample_id: SAMPLE001 # Local structured ID sample_rrid: RRID:AB_2783747 # Biological resource cell_line_rrid: RRID:CVCL_0302 chemical_inchi: InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 chemical_smiles: CCO
Recommendation 6 - Make identifiers resolvable
Identifiers become actionable when they lead to a landing page or metadata record. Resolution enables both humans and machines to discover and interpret the resource.
Benefit: Supports findability and accessibility through standard web protocols.
How: Ensure that identifiers resolve to a landing page or metadata record that humans and machines can access. For example, DOIs registered with DataCite or Crossref provide resolvable links. Other persistent identifier systems can also be used. Guidance from FAIR Cookbook FCB077 can support implementation.
Example:
# Unresolvable DOI dataset_doi: https://doi.org/10.1234/example.dataset # Resolvable DOI that leads to a dataset landing page dataset_doi: https://doi.org/10.5281/zenodo.6958051
Recommendation 7 - Use namespaces for local identifiers
Local identifiers need scoping so they can safely be shared outside their original system. Namespaces turn local IDs into globally unique ones.
Benefit: Prevents collisions and keeps identifiers reliable as they move across systems.
How: Assign a consistent prefix to local identifiers and document it in metadata to ensure global uniqueness. When possible, register the namespace with services such as http://Identifiers.org to make it globally traceable. The prefix example can include a project or organisational code. Other approaches to namespace management may also be used depending on community practices.
Example:
# Local IDs with namespaces local_sample_id: PROJ123:SAMPLE001 # Resolvable via Identifiers.org if registered resolved_id: https://identifiers.org/pubmed:22140103
Recommendation 8 - Map equivalences between identifier systems
Different communities often use different identifiers for the same concept. Mapping these makes your data interoperable across systems.
Benefit: Creates bridges between identifier schemes and supports cross-dataset integration.
How: Map equivalent identifiers across different systems to support interoperability. For example, use BridgeDb or the SSSOM standard standard, and refer to FAIR Cookbook FCB016 for guidance on identifier mapping. Other mapping frameworks may also be appropriate depending on the community and identifier systems in use.
Example:
# Mapping across identifier systems (SSSOM format) subject_id: ENSEMBL:ENSG00000139618 subject_label: BRCA2 object_id: NCBIGene:675 object_label: BRCA2 predicate_id: sssom:exactMatch
Recommendation 9 - Ensure governance and infrastructure for persistence
Identifier quality depends on stable services that guarantee resolution and long-term accessibility. Good governance protects identifiers from decay.
Benefit: Maintains trust and usability of identifiers across decades.
How: Establish or rely on stable services that guarantee identifier resolution and long-term accessibility. PID infrastructures (GUPrI) provide these services as examples. Governance arrangements and technical infrastructure are usually managed at the institutional or community level rather than by individual researchers. To make governance concrete, consider questions such as:
Who is responsible for maintaining resolution if a system changes?
What happens to identifiers when a project ends?
Does your institution have a PID policy?
These questions help ensure identifiers remain persistent, resolvable, and trustworthy over time.
Example:
# PID that is maintained via a public resolver pid: https://n2t.net/hdl:20.500.12633/1HK1DTv1wPt3a # persistent identifier resolved through the n2t.net/Handle system, ensuring long-term accessibility
Expertise requirements for this step
Based on the expertise described in the Metroline: Build the team step, the following expertise may be relevant:
Data specialist: Understands which identifiers are needed.
Technical specialist: Can generate or integrate identifiers in systems.
Policy specialist: Ensures identifier use follows agreed rules.
Practical examples from the community
See table above in paragraph Short Description
Training
Digital Preservation Coalition specific training on Meta-data Novice to Know-How: Online Digital Preservation Training - Digital Preservation Coalition provides specific advise on how to choose PID schemes and the criteria for choosing an adequate one for digital preservation and findability. This training also covers best practices and range of options available.
Suggestions
This page will be developed in the future. Learn more about the contributors here and explore the development process here. If you have any suggestions, visit our How to contribute page to get in touch.