Metroline Step: Select identifier scheme

Metroline Step: Select identifier scheme

Status: IN development

 

'Data that are not discoverable cannot be reused, and data that cannot be reused are not FAIR.' (FAIR Principles paper)

In simple terms, selecting an identifier scheme is like deciding how to label books in a library. If every book has a clear catalog number, author, and subject, readers can always find what they need. Without a consistent scheme, books would be scattered and confusing. For datasets, choosing the right identifier scheme ensures that your data can be discovered, cited, and reused. Similar to well‑organised books in a library.

Short description 

Identifiers are the anchors of FAIR (findable, accessible, interoperable, reusable) data. They give each resource, whether a dataset or an element in a metadata schema, a unique and reliable point of reference. In FAIR settings this applies across datasets, metadata records, people, organisations, licences and even individual data values. Effective identifiers must be globally unique, persistent, machine actionable and resolvable, and they should be used consistently so that resources can be connected without ambiguity. Domains such as chemistry show the value of this clearly because identifiers like InChI or SMILES capture molecular structures in a reproducible way that allows data to be shared and interpreted across tools and disciplines.

Why is this step important

Identifiers matter because they provide the stability and clarity that allow data and metadata to function reliably in FAIR ecosystems. Their key properties each support a different aspect of that reliability.

  • Globally unique. Prevents collisions so a resource is never confused with another, which supports reliable discovery and stable referencing across systems.

  • Persistent. Remains valid over time, even when systems or storage locations change, which keeps data citable and usable in the long term.

  • Machine actionable. Has a defined structure that software can interpret directly, which removes ambiguity and enables automation.

  • Resolvable. Can be followed through a recognised protocol to metadata or to the resource itself, which strengthens findability and accessibility.

How to

Step 1 - Reuse community identifiers whenever possible

Reusing existing identifiers prevents duplication and immediately connects your data to the wider research ecosystem. Many entities already have authoritative identifiers that you can adopt directly.

  • Benefit. Strengthens interoperability and aligns your dataset with established knowledge graphs.

  • How. Check registries for existing identifiers. Use ORCID for creators, ROR for institutions and FundRef for funders. Identifiers.org helps locate identifiers for many scientific entities.

  • Example. An existing ORCID for a researcher or a FundRef entry for a funding body.

Step 2 - Mint new identifiers when none exist

When no suitable identifier is available, minting a new one ensures the resource has a unique and persistent reference point. This is essential for stable reuse and accurate citation.

  • Benefit. Ensures the resource has a persistent and widely recognisable identifier instead of an ad hoc local one, supporting reliable citation and reuse.

  • How. Use DataCite for DOIs. Follow FAIR Cookbook FCB006. Use MINIDs for lightweight identifiers as described in FCB008.

  • Example (MINID).
    minid register --title "Dataset X" --checksum abc123 --metadata metadata.json

Step 3 - Assign a persistent identifier to each dataset

Every dataset needs a stable anchor that supports discovery, citation and long-term accessibility. Assigning a PID formalises the dataset as a citable and traceable resource.

  • Benefit: Catalogue services and automated workflows can reliably reference the dataset over time.

  • How: Register the dataset metadata with a PID authority such as DataCite or Crossref and obtain a DOI that will function as the dataset’s primary reference.

  • Example: Mint a DOI through DataCite.

Step 4 - Use identifiers consistently throughout the metadata

Metadata should reference people, organisations, licences and related resources using standard identifiers. This removes ambiguity and allows machines to interpret relationships correctly.

  • Benefit. Removes ambiguity in metadata, improves interoperability and supports automated processing across systems.

  • How: Use ORCID for authors, ROR for organisations and SPDX identifiers for licences.

  • Example: A metadata record that includes creator ORCIDs and institutional ROR identifiers.

Step 5 - Assign identifiers to relevant data entities and values

Internal elements such as records, variables or samples also need stable identifiers so they can be referenced unambiguously within and across datasets.

  • Benefit. Allows tools to reference and interpret the individual records and values inside a dataset reliably, which supports automation and reuse across datasets.

  • How: Use controlled vocabularies or domain ontologies to select existing identifiers for values. For internal entities, generate local identifiers in a structured format that remains stable over time.

  • Example: RRID for biological resources, InChI or SMILES for chemical structures.

Step 6 - Make identifiers resolvable

Identifiers become actionable when they lead to a landing page or metadata record. Resolution enables both humans and machines to discover and interpret the resource.

  • Benefit: Supports findability and accessibility through standard web protocols.

  • How: Follow FAIR Cookbook FCB077. Register identifiers with DataCite or Crossref.

  • Example:
    https://doi.org/10.1234/example.dataset

Step 7 - Use namespaces for local identifiers

Local identifiers need scoping so they can safely be shared outside their original system. Namespaces turn local IDs into globally unique ones.

  • Benefit: Prevents collisions and keeps identifiers reliable as they move across systems.

  • How: Define a consistent prefix for local identifiers and document it in your metadata. If applicable, register the prefix with Identifiers.org so the namespace becomes globally traceable.

  • Example: Prefixing local sample IDs with a project or organisational namespace.

Step 8 - Map equivalences between identifier systems

Different communities often use different identifiers for the same concept. Mapping these makes your data interoperable across systems.

  • Benefit: Creates bridges between identifier schemes and supports cross-dataset integration.

  • How: Use BridgeDb or the SSSOM standard. Refer to FAIR Cookbook FCB016 for guidance on identifier mapping.

  • Example (SSSOM TSV):

    subject_id subject_label object_id object_label predicate_id ENSEMBL:ENSG00000139618 BRCA2 NCBIGene:675 BRCA2 sssom:exactMatch

Step 9 - Ensure governance and infrastructure for persistence

Identifier quality depends on stable services that guarantee resolution and long-term accessibility. Good governance protects identifiers from decay.

  • Benefit: Maintains trust and usability of identifiers across decades.

  • Example: PID infrastructures such as GUPrI that manage resolution and persistence.

 

Expertise requirements for this step 

Based on the expertise described in the Metroline: Build the team step, the following expertise may be relevant:

  • FAIR data steward. Defines which identifier schemes are used, when identifiers should be reused or minted and how they must comply with FAIR principles. Oversees consistency, policy alignment and long-term stewardship across datasets and systems.

Practical examples from the community 

  •  

Training

Suggestions

This page will be developed in the future. Learn more about the contributors here and explore the development process here. If you have any suggestions, visit our How to contribute page to get in touch.