Metroline Step: Query (use) over resources

Metroline Step: Query (use) over resources

status: RELEASED

This page is being moved. The latest version is available on our website and this Confluence page will no longer be updated.

‘Research is formalized curiosity. It is poking and prying with a purpose.' (Zora Neale Hurston)

Querying FAIR resources is the process of asking structured questions to retrieve data that is easy to find, access, and reuse. When resources follow FAIR principles, queries become more efficient because the data is well-organized and clearly described. In this step, we will explore the tools and platforms that enable effective querying of FAIR-compliant resources.

Short description 

When machine-readable (meta)data is exposed (see Metroline step Transform and Expose FAIR metadata), it becomes an accessible FAIR resource. In other words, a dataset or metadata collection which can be found, queried, and reused. Such resources are often hosted or described in catalogues and/or via FAIR Data Points, which expose (meta)data in a standardized way. This ability to discover and reuse data using the metadata resources is what makes FAIR so powerful: it turns isolated data into actionable knowledge for science.

These catalogues offer different levels of interaction:

  1. Browsing. You can navigate through a FAIR Data Point (FDP) or catalogue to discover available datasets, for example in the The National Health Data Catalogue or EBI BioStudies.

  2. Filtering and faceted search. Similar to filtering products in a webshop, results can be narrowed by disease, data type, species, or other metadata attributes, as supported by the European Health Research Data and Sample Catalogue.

  3. Visual query builders. Some platforms provide user-friendly query forms that automatically generate queries behind the scenes, as seen in Wikidata or the European Nucleotide Archive.

  4. Direct querying. For more advanced use, researchers and developers can write and run their own queries using external clients or scripts, for example through database queries with R.

Query results can be displayed in formats like HTML, JSON, XML, or CSV, depending on the tool or user preference.

This Metroline page focuses on SPARQL as it is the standard query language for RDF-based resources, which are the foundation of the semantic web and linked data. These concepts aim to make data interoperable and machine-readable across domains, enabling powerful integration and reuse. SPARQL’s standardized syntax and ability to retrieve both metadata and data from diverse sources make it uniquely suited for querying structured web resources. While SPARQL is prominent in the semantic web domain, there are many other query languages tailored to different data models, research fields, and application needs (see table below).

Query Language

Purpose

Used In

Example Repositories

Structured Query Language (SQL)

Querying relational databases

Tabular data, metadata

Dryad, Dataverse, OpenAIRE

SPARQL (for RDF/Linked Data)

Querying semantic web data

Ontologies, linked datasets

UniProt, OpenPHACTS, ELIXIR, Bio2RDF

GraphQL

Flexible API queries

Nested data structures

EMBL-EBI

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

Metadata harvesting

Repository interoperability

Zenodo, Figshare, institutional repositories

JSONPath / XPath

Extracting data from JSON/XML

API responses, metadata

Ensembl, NIH

Cypher

Querying graph databases

Networked biological data

Neo4j-based bioinformatics platforms

Why is this step important 

Querying FAIR data is important because it is how you actually use the data. FAIR data is only valuable if it can be discovered, filtered, combined and analysed and querying is how this is made possible.

  • Find exactly what you need. General search and filtering allow you to locate datasets or specific information quickly, without manually checking every record.

  • Explore and understand data. Browsing and faceted search help you see what datasets exist, what they describe and how they are structured.

  • Combine and reuse information efficiently. Advanced queries (e.g. SPARQL) let you combine and analyse data from multiple sources without moving large datasets.

How to 

This how-to gives information about querying FAIR resources, starting with simple browsing and filtering, moving to visual query tools, and advancing to federated multi-source querying with SPARQL.

Step 1 - Start with browsing and filtering

The easiest way to explore FAIR data is through a catalogue or FAIR Data Point interface, such as the National Health Data Portal, FAIRsharing.org or Local FAIR Data Points. (To learn more about FAIR Data Points, see Metroline Step: Transform and expose FAIR (meta)data.).
Here you can:

  • Browse datasets and read metadata (description, owner, access conditions).

  • Search by keywords, e.g. “muscular dystrophy” or “metabolomics”.

  • Filter results by categories such as data type, disease, measurement, or year.

This helps you discover what exists before performing any (complex) queries.

🧪 Example for Step 1: Browse Wikidata for information about Inflammatory bowel disease

To begin, we search for “Inflammatory bowel disease” in the search bar on Wikidata.org. This leads us to the item Q917447 which represents IBD in Wikidata. This item confirms that IBD is a recognized disease entity with structured metadata (such as classifications, related conditions, and identifiers) providing a solid starting point for further data exploration. We gained insight into what the catalogue contains, what metadata is available, and how we might formulate more specific queries to retrieve related information.

Step 2 - Use visual or guided tools to construct queries

Some linked-data portals offer visual query builders that help users construct SPARQL queries without needing to learn the syntax. These tools automatically translate your selections (such as ticking checkboxes or choosing from dropdown menus) into SPARQL and run the query in the background. Such as SPARQL Query builder or Wikidata Query Builder.

The results are typically displayed in a table or graph, making it easy to explore data without writing any code. This approach is ideal for users who want to go beyond simple browsing but aren’t yet ready to write SPARQL manually. 

🧪 Example for Step 2: Wikidata query builder – finding diseases associated with IL23R gene

We want to continue our exploration of inflammatory bowel disease. In our first exploration of the Wikipedia page, we saw that IBD has genetic association to the gene IL23R. We want to query what other items this gene has genetic association to. In the Wikidata query builder, we would then put “genetic association” under Property and “IL23R” as value and run the query. We get 6 results, as seen in the below table. 

item

itemLabel

wd:Q179945

psoriasis

wd:Q917447

inflammatory bowel diseases

wd:Q32144272

inflammatory bowel disease 17

wd:Q52849

ankylosing spondylitis

wd:Q1472

Crohn's disease

wd:Q1477

ulcerative colitis

 

 

Step 3 - Access the SPARQL endpoint to write and refine SPARQL queries

Note: The following steps are meant specifically for querying catalogues and repositories with SPARQL endpoint. If you are trying to query a catalogue based on another querying approach (e.g. SQL), these may not be directly applicable.

When you need more flexibility, connect directly to the SPARQL endpoint. Depending on the catalog you can use:

Try simple queries first, such as listing datasets or retrieving specific metadata fields.
As you become more comfortable, you can write more complex queries that join related information, apply filters, or aggregate data using SPARQL syntax.

Helpful tutorials:

🧪 Example for Step 3: Use wikidata query service

We saw that the gene IL23R is associated with the disease psoriasis. Now, let’s take it a step further and run a more complex SPARQL query to find which genes are associated with both Inflammatory Bowel Disease (IBD) and psoriasis.

See and run the query yourself at this link: https://w.wiki/FsqH

 

The results show all genes linked to both psoriasis and an IBD condition. For each gene, you can also see the specific IBD disease it is associated with (such as Crohn’s disease or ulcerative colitis) providing a richer context for analysis.

Step 4 - Combine multiple FAIR sources (federated queries)

When your question spans several data sources, use federated querying. This allows you to connect endpoints across registries, institutions, or countries, combining data without moving it.

In SPARQL, federated queries are implemented using the SERVICE keyword, which lets you call another SPARQL endpoint within your query. This enables seamless integration of data across different FAIR sources. See documentation on SPARQL federated querying here.

 

Step 5 - Export and reuse query results

Query results can be downloaded in multiple formats (e.g. CSV, JSON, XML) for reuse in data analysis tools like Python, R, or Excel.
Depending on the query language and platform, it may also be possible to integrate queries directly into your workflow (for example, by calling SPARQL endpoints from Python or R scripts) so that results flow into subsequent analysis steps without the need to download files manually.

For human users, many catalog interfaces also provide built-in visualization options, allowing results to be displayed as tables, graphs, or maps directly in the browser without additional tools.

🧪 Example for Step 5: Visualization of query results

In Wikidata, you can visualize query results in different ways by switching between different result views. Try to run the example query from above https://w.wiki/FsqH and experiment with the various visualization and export options.

Expertise requirements for this step 

To successfully perform this step, you may need help from the following experts:

  • Researcher/domain expert. Uses domain knowledge to formulate queries and interpret results.

  • Data scientist. Executes queries, processes results and handles federated queries.

  • Semantic expert. Ensures correct use of metadata, vocabularies and ontologies for queries.

See Metroline: Build the team for more information.

Practical examples from the community 

SPHN Data Exploration and Analysis System (DEAS) - DEAS is a cross-hospital federated query tool developed by the Swiss Personalized Health Network (SPHN) to replace the previous Federated Query System. It enables researchers to securely query aggregated clinical data from multiple Swiss university hospitals without moving patient-level data.

Training

Suggestions

This page will be developed in the future. Learn more about the contributors here and explore the development process here. If you have any suggestions, visit our How to contribute page to get in touch.