🧬 Investigate a way to propagate the provenance of queryservice-ui source data into the processed query table
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Tarrow
	Jun 4 2025, 10:32 AM

Description

The goal of this investigation should be to figure out if there is a way to mark requests coming from the QueryService UI to the wikidata blazegraph so that we can identify them in the discovery processed query table.

We should make sure to keep in mind that we might want to also record queries as coming from other WMDE maintained clients in the future. For example other querservice-ui's and the query builder. Therefore this mark should probably *not* look like queryservice-gui = true but instead more like client-provenance=queryservice-gui-<maybe version>.

See: https://wikitech.wikimedia.org/wiki/Provenance as a possible way to mark there requests with wprov query parameter. See also on that page the https://wikitech.wikimedia.org/wiki/X-Analytics header for where this comes out in the webrequest log. You can also consider adding stuff to this header clientside but only a limited set of keys are supported.

Details

Related Changes in Gerrit:

	Subject	Repo	Branch	Lines +/-
	rdf-spark-tools: populate api_user_agent column	wikidata/query/rdf	master	+1 -0
	wikidata-query-gui: Bump query-gui image version	operations/deployment-charts	master	+1 -1

Customize query in gerrit

Related Changes in GitLab:

Title	Reference	Author	Source Branch	Dest Branch
search: Update rdf-spark-tools to 0.3.158	repos/data-engineering/airflow-dags!1546	ebernhardson	work/ebernhardson/rdf-158	main
search: add api_user_agent column to process_sparql_query table	repos/data-engineering/airflow-dags!1464	andrew-wmde	search-sparql-api-user-agent-column	main
Track where queries are being sent from	repos/wmde/wikidata-query-gui!19	andrew-wmde	trackable-instance	main

Customize query in GitLab

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	Goal	None	T394987 🧬 Establish high-level metrics & baselines for measuring query federation in the ecosystem
Declined		None	T394988 🧬 Investigate measuring federation on WDQS/WBQS UI
Resolved		Andrew-WMDE	T396002 🧬 Investigate a way to propagate the provenance of queryservice-ui source data into the processed query table

Event Timeline

Tarrow created this task.Jun 4 2025, 10:32 AM

Andrew-WMDE claimed this task.Jun 4 2025, 10:35 AM

Andrew-WMDE moved this task from To do to Doing on the Wikibase Cloud (Kanban Board Q2 2025) board.

Tarrow mentioned this in T395426: 🧬 Sync queryservice-ui with upstream.Jun 4 2025, 10:36 AM

FYI the repo has now moved from gerrit to gitlab at https://gitlab.wikimedia.org/repos/wmde/wikidata-query-gui

andrew-wmde opened https://gitlab.wikimedia.org/repos/wmde/wikidata-query-gui/-/merge_requests/19

Draft: Track where queries are being sent from

Here are three ways we can achieve this:

(1) Add a query parameter with a custom name (e.g., source).
(2) Set the wprov query parameter (see https://wikitech.wikimedia.org/wiki/Provenance).
(3) Set the api-user-agent request header. It is one of the few headers allowed by WMF's access-control-allow-headers policy.

Each of these are demonstrated in the draft patch on GitLab.

While it is possible to set user-agent, Chrome browsers may silently drop this header.

Note: The User-Agent header used to be forbidden, but no longer is. However, Chrome still silently drops the header from Fetch requests (see Chromium bug 571722).

see https://developer.mozilla.org/en-US/docs/Glossary/Forbidden_request_header
see https://issues.chromium.org/issues/40450316

WMF's workaround for this user-agent issue is to use api-user-agent.

Browser-based applications written in JavaScript are typically forced to send the same User-Agent header as the browser that hosts them. This is not a violation of policy, however such applications are encouraged to include the Api-User-Agent header to supply an appropriate agent.

see https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Foundation_User-Agent_Policy

While all three approaches work and are logged correctly in the analytics database, there are a few caveats to consider. Going with (1) will likely break the WMF's cache, as it's technically a different resource being requested (from the proxy's perspective). Option (2) has a cache exception, so different wprov values will not break the cache (see https://github.com/wikimedia/operations-puppet/blob/b0175e8d56eb7a8960fbd0d35955334a4f042e1e/modules/varnish/templates/analytics.inc.vcl.erb#L141-L162). However, for what we are planning to collect, we will not be adhering to the WMF's format for wprov (see https://wikitech.wikimedia.org/wiki/Provenance). Using either (1) or (2) will also expose these query parameters in the URL when using the query sharing functionality.

Regardless of which option we choose, we will need to update the parser for the discovery.processed_external_sparql_query table.

Formatting examples:

?source=qs-ui-query.wikidata.org
?wprov=qs-ui-query.wikidata.org
api-user-agent: "qs-ui (https://query.wikidata.org)"

Example queries for use on WMF's Superset:

SELECT *
FROM event.wdqs_external_sparql_query
WHERE 
  year = 2025 AND month = 6 AND day = 4
  AND map_key_exists(params, 'source')
  AND starts_with(element_at(params, 'source'), 'qs-ui')

SELECT *
FROM event.wdqs_external_sparql_query
WHERE 
  year = 2025 AND month = 6 AND day = 4
  AND map_key_exists(http.request_headers, 'x-wmf-wprov')
  AND starts_with(element_at(http.request_headers, 'x-wmf-wprov'), 'qs-ui')

SELECT *
FROM event.wdqs_external_sparql_query
WHERE 
  year = 2025 AND month = 6 AND day = 4
  AND map_key_exists(http.request_headers, 'api-user-agent')
  AND starts_with(element_at(http.request_headers, 'api-user-agent'), 'qs-ui')

Andrew-WMDE moved this task from Doing to In Review on the Wikibase Cloud (Kanban Board Q2 2025) board.Jun 6 2025, 2:10 PM

Added an approval comment to the gitlab PR and can confirm filtering the data works based on the different methods

Thanks for working on this @Andrew-WMDE and confirming the three approaches work @dena.

Looking at the options, my vote would be for (3) due to it being encouraged by the WMF and doesn't affect caching or query sharing.

tarrow merged https://gitlab.wikimedia.org/repos/wmde/wikidata-query-gui/-/merge_requests/19

Track where queries are being sent from

Strongly agree with the other two people; let's get this thing shipped and we can start collecting data. I asked @karapayneWMDE and checked that we wouldn't be treading on toes to ship this independently. I've approved and merged the patch. Let's look at deploying to Wikidata tomorrow?

@Lucas_Werkmeister_WMDE also asked how this differs from referer and I believe there isn't a huge difference although this is less likely to be suppressed by the browser. It also maps nicely to being used in other places (e.g. querybuilder or cross blazegraph requests)

Do we want a separate ticket for the parser changes?

Tarrow moved this task from In Review to Waiting for Deploy to Staging on the Wikibase Cloud (Kanban Board Q2 2025) board.Jun 10 2025, 4:37 PM

Tarrow moved this task from Waiting for Deploy to Staging to In Review on the Wikibase Cloud (Kanban Board Q2 2025) board.

Maintenance_bot removed a project: Patch-For-Review.Jun 10 2025, 5:30 PM

In case it wasn't obvious there are some deploy docs on the README.md at https://gitlab.wikimedia.org/repos/wmde/wikidata-query-gui#deploy-in-wmf-environment-querywikidataorg

Change #1155643 had a related patch set uploaded (by Andrew-WMDE; author: Andrew-WMDE):

[operations/deployment-charts@master] wikidata-query-gui: Bump query-gui image version

https://gerrit.wikimedia.org/r/1155643

gerritbot added a project: Patch-For-Review.Jun 11 2025, 11:19 AM

Change #1155643 merged by jenkins-bot:

[operations/deployment-charts@master] wikidata-query-gui: Bump query-gui image version

https://gerrit.wikimedia.org/r/1155643

Maintenance_bot removed a project: Patch-For-Review.Jun 11 2025, 2:30 PM

Tarrow moved this task from In Review to Doing on the Wikibase Cloud (Kanban Board Q2 2025) board.Jun 12 2025, 8:02 AM

Change #1163789 had a related patch set uploaded (by Andrew-WMDE; author: Andrew-WMDE):

[wikidata/query/rdf@master] rdf-spark-tools: populate api_user_agent column

https://gerrit.wikimedia.org/r/1163789

gerritbot added a project: Patch-For-Review.Jun 25 2025, 2:04 PM

andrew-wmde opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1464

Draft: search: add api_user_agent column to process_sparql_query table

Tarrow mentioned this in T397528: 🧬 Propagate provenance of scholarly vs main graph queries to processed table.Jun 26 2025, 8:50 AM

Ollie.Shotton_WMDE renamed this task from Investigate a way to propagate the provenance of queryservice-ui source data into the processed query table to 🧬 Investigate a way to propagate the provenance of queryservice-ui source data into the processed query table.Jun 26 2025, 2:15 PM

Ollie.Shotton_WMDE edited parent tasks, added: T394988: 🧬 Investigate measuring federation on WDQS/WBQS UI; removed: T394987: 🧬 Establish high-level metrics & baselines for measuring query federation in the ecosystem.Jun 27 2025, 8:21 AM

Andrew-WMDE moved this task from Doing to In Review on the Wikibase Cloud (Kanban Board Q2 2025) board.Jul 3 2025, 7:53 AM

sowmya.guru moved this task from In Review to PM/UX Verification on the Wikibase Cloud (Kanban Board Q2 2025) board.Jul 4 2025, 8:44 AM

sowmya.guru moved this task from PM/UX Verification to In Review on the Wikibase Cloud (Kanban Board Q2 2025) board.

ebernhardson merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1464

search: add api_user_agent column to process_sparql_query table

Change #1163789 merged by jenkins-bot:

[wikidata/query/rdf@master] rdf-spark-tools: populate api_user_agent column

https://gerrit.wikimedia.org/r/1163789

ebernhardson opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1546

search: Update rdf-spark-tools to 0.3.158

ebernhardson merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1546

search: Update rdf-spark-tools to 0.3.158

Updated bits were deployed and are now running. I re-ran a few recent hours through and we get the following counts, not sure if these are expected. Essentially it sees a bit under 1k/hr in the couple hours i re-ran.

spark.sql("select hour, count(1), count(api_user_agent) from discovery.processed_external_sparql_query where year=2025 and month=7 and day=8 group by hour").toPandas().set_index('hour').sort_index()

      count(1)  count(api_user_agent)
hour                                 
0       562239                      0
1       546538                      0
2       591988                      0
3       542236                      0
4       548077                      0
5       547726                      0
6       503367                      0
7       541875                      0
8       811956                      0
9       593457                      0
10      585822                      0
11      593838                      0
12      628519                      0
13      920476                    795
14      705369                    986

Maintenance_bot removed a project: Patch-For-Review.Jul 8 2025, 6:30 PM

Anton.Kokh moved this task from In Review to Done on the Wikibase Cloud (Kanban Board Q2 2025) board.Jul 10 2025, 1:18 PM

Anton.Kokh mentioned this in T395044: 🧬Investigate measuring federation on WDQS/WBQS backend.Jul 10 2025, 3:41 PM

Anton.Kokh closed this task as Resolved.Jul 11 2025, 2:31 PM

🧬 Investigate a way to propagate the provenance of queryservice-ui source data into the processed query tableClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

🧬 Investigate a way to propagate the provenance of queryservice-ui source data into the processed query table
Closed, ResolvedPublic
Actions

Related Objects
Search...