Page MenuHomePhabricator

🧬 Investigate a way to propagate the provenance of queryservice-ui source data into the processed query table
Closed, ResolvedPublic

Description

The goal of this investigation should be to figure out if there is a way to mark requests coming from the QueryService UI to the wikidata blazegraph so that we can identify them in the discovery processed query table.

We should make sure to keep in mind that we might want to also record queries as coming from other WMDE maintained clients in the future. For example other querservice-ui's and the query builder. Therefore this mark should probably *not* look like queryservice-gui = true but instead more like client-provenance=queryservice-gui-<maybe version>.

See: https://wikitech.wikimedia.org/wiki/Provenance as a possible way to mark there requests with wprov query parameter. See also on that page the https://wikitech.wikimedia.org/wiki/X-Analytics header for where this comes out in the webrequest log. You can also consider adding stuff to this header clientside but only a limited set of keys are supported.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
search: Update rdf-spark-tools to 0.3.158repos/data-engineering/airflow-dags!1546ebernhardsonwork/ebernhardson/rdf-158main
search: add api_user_agent column to process_sparql_query tablerepos/data-engineering/airflow-dags!1464andrew-wmdesearch-sparql-api-user-agent-columnmain
Track where queries are being sent fromrepos/wmde/wikidata-query-gui!19andrew-wmdetrackable-instancemain
Customize query in GitLab

Event Timeline

Here are three ways we can achieve this:

(1) Add a query parameter with a custom name (e.g., source).
(2) Set the wprov query parameter (see https://wikitech.wikimedia.org/wiki/Provenance).
(3) Set the api-user-agent request header. It is one of the few headers allowed by WMF's access-control-allow-headers policy.

Each of these are demonstrated in the draft patch on GitLab.

While it is possible to set user-agent, Chrome browsers may silently drop this header.

Note: The User-Agent header used to be forbidden, but no longer is. However, Chrome still silently drops the header from Fetch requests (see Chromium bug 571722).

see https://developer.mozilla.org/en-US/docs/Glossary/Forbidden_request_header
see https://issues.chromium.org/issues/40450316

WMF's workaround for this user-agent issue is to use api-user-agent.

Browser-based applications written in JavaScript are typically forced to send the same User-Agent header as the browser that hosts them. This is not a violation of policy, however such applications are encouraged to include the Api-User-Agent header to supply an appropriate agent.

see https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Foundation_User-Agent_Policy

While all three approaches work and are logged correctly in the analytics database, there are a few caveats to consider. Going with (1) will likely break the WMF's cache, as it's technically a different resource being requested (from the proxy's perspective). Option (2) has a cache exception, so different wprov values will not break the cache (see https://github.com/wikimedia/operations-puppet/blob/b0175e8d56eb7a8960fbd0d35955334a4f042e1e/modules/varnish/templates/analytics.inc.vcl.erb#L141-L162). However, for what we are planning to collect, we will not be adhering to the WMF's format for wprov (see https://wikitech.wikimedia.org/wiki/Provenance). Using either (1) or (2) will also expose these query parameters in the URL when using the query sharing functionality.

Regardless of which option we choose, we will need to update the parser for the discovery.processed_external_sparql_query table.

Formatting examples:

?source=qs-ui-query.wikidata.org
?wprov=qs-ui-query.wikidata.org
api-user-agent: "qs-ui (https://query.wikidata.org)"

Example queries for use on WMF's Superset:

SELECT *
FROM event.wdqs_external_sparql_query
WHERE 
  year = 2025 AND month = 6 AND day = 4
  AND map_key_exists(params, 'source')
  AND starts_with(element_at(params, 'source'), 'qs-ui')
SELECT *
FROM event.wdqs_external_sparql_query
WHERE 
  year = 2025 AND month = 6 AND day = 4
  AND map_key_exists(http.request_headers, 'x-wmf-wprov')
  AND starts_with(element_at(http.request_headers, 'x-wmf-wprov'), 'qs-ui')
SELECT *
FROM event.wdqs_external_sparql_query
WHERE 
  year = 2025 AND month = 6 AND day = 4
  AND map_key_exists(http.request_headers, 'api-user-agent')
  AND starts_with(element_at(http.request_headers, 'api-user-agent'), 'qs-ui')

Added an approval comment to the gitlab PR and can confirm filtering the data works based on the different methods

Thanks for working on this @Andrew-WMDE and confirming the three approaches work @dena.

Looking at the options, my vote would be for (3) due to it being encouraged by the WMF and doesn't affect caching or query sharing.

Strongly agree with the other two people; let's get this thing shipped and we can start collecting data. I asked @karapayneWMDE and checked that we wouldn't be treading on toes to ship this independently. I've approved and merged the patch. Let's look at deploying to Wikidata tomorrow?

@Lucas_Werkmeister_WMDE also asked how this differs from referer and I believe there isn't a huge difference although this is less likely to be suppressed by the browser. It also maps nicely to being used in other places (e.g. querybuilder or cross blazegraph requests)

Do we want a separate ticket for the parser changes?

Change #1155643 had a related patch set uploaded (by Andrew-WMDE; author: Andrew-WMDE):

[operations/deployment-charts@master] wikidata-query-gui: Bump query-gui image version

https://gerrit.wikimedia.org/r/1155643

Change #1155643 merged by jenkins-bot:

[operations/deployment-charts@master] wikidata-query-gui: Bump query-gui image version

https://gerrit.wikimedia.org/r/1155643

Change #1163789 had a related patch set uploaded (by Andrew-WMDE; author: Andrew-WMDE):

[wikidata/query/rdf@master] rdf-spark-tools: populate api_user_agent column

https://gerrit.wikimedia.org/r/1163789

Ollie.Shotton_WMDE renamed this task from Investigate a way to propagate the provenance of queryservice-ui source data into the processed query table to 🧬 Investigate a way to propagate the provenance of queryservice-ui source data into the processed query table.Jun 26 2025, 2:15 PM

Change #1163789 merged by jenkins-bot:

[wikidata/query/rdf@master] rdf-spark-tools: populate api_user_agent column

https://gerrit.wikimedia.org/r/1163789

Updated bits were deployed and are now running. I re-ran a few recent hours through and we get the following counts, not sure if these are expected. Essentially it sees a bit under 1k/hr in the couple hours i re-ran.

spark.sql("select hour, count(1), count(api_user_agent) from discovery.processed_external_sparql_query where year=2025 and month=7 and day=8 group by hour").toPandas().set_index('hour').sort_index()
      count(1)  count(api_user_agent)
hour                                 
0       562239                      0
1       546538                      0
2       591988                      0
3       542236                      0
4       548077                      0
5       547726                      0
6       503367                      0
7       541875                      0
8       811956                      0
9       593457                      0
10      585822                      0
11      593838                      0
12      628519                      0
13      920476                    795
14      705369                    986