Data Platform/Data Lake

The Analytics Data Lake (ADL), or the Data Lake for short, is a large, analytics-oriented repository of data about Wikimedia projects (in industry terms, a data lake).

Data available

Traffic data: Webrequest, pageviews, and unique devices
Edits data: Historical data about revisions, pages, and users (e.g. MediaWiki History)
Content data: Wikitext (latest & historical) and wikidata-entities
Events data: EventLogging, EventBus and event streams data (raw, refined, sanitized)
Commons Impact Metrics: Contributions to Wikimedia Commons focused on GLAMs

Some of these datasets (such as webrequest) are only available in the Data Lake, while others (such as pageviews) are also available in more aggregated form as Druid data cubes.

Access

The main way to access the data in the Data Lake is to run SQL queries using either Presto or Spark (Hive is also available, but inferior and deprecated).

You can access these engines through several different routes:

Superset has a graphical SQL editor where you can run Presto queries
Custom code on one of the stat hosts (the easiest way to do this is to use our Jupyter service)
- for Python, use the Wmfdata-Python package
- for R, use the wmfdata-r package
Both also support ad-hoc querying via command-line programs which you can use on one of the stat hosts. Consult the engine's documentation page for more info.

Syntax differences between the SQL engines

For the most part, Presto, Hive, and Spark work the same way, but they have some differences in SQL syntax.


use case	Spark	Presto	Hive
keyword for the string data type	`STRING`	`VARCHAR`	`STRING`
string literal	`'foo'`, `"foo"`	`'foo'`	`'foo'`, `"foo"`
keyword for 32-bit float data type	`FLOAT`, `REAL`	`REAL`	`FLOAT`
keyword for 64-bit float data type	`DOUBLE`
select a column named with a reserved word (e.g. `DATE`)	`date`	`"date"`	`date`
get the length of an array	`SIZE(a)`	`CARDINALITY(a)`	`SIZE(a)`
concatenate strings with a separator	`CONCAT_WS`	not available	`CONCAT_WS`
count rows which match a condition	`COUNT_IF(x = y)`	`COUNT_IF(x = y)`	`SUM(CAST(x = y AS INT))`
transform integer `year`/`month`/`day` fields to a date string	`MAKE_DATE(year, month, day)`	`CONCAT(CAST(year AS VARCHAR), '-', LPAD(CAST(month AS VARCHAR), 2, '0'), '-', LPAD(CAST(day AS VARCHAR), 2, '0'))`	`CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0'))`
convert an ISO 8601 timestamp string (e.g. `"2021-11-01T01:23:02Z"`) to an SQL timestamp	`TO_TIMESTAMP(dt)`	`FROM_ISO8601_TIMESTAMP(dt)`	`FROM_UNIXTIME(UNIX_TIMESTAMP(dt, "yyyy-MM-dd'T'HH:mm:ss'Z'"))`
divide integers, returning a float if necessary	`x / y`	`CAST(x AS DOUBLE) / y`	`x / y`
select the first or last rows in a group	`FIRST`, `LAST`	not available	not available

It's useful to get in the habit of using singled quoted text ('foo') for strings, since all three engines interpret it the same way. Double quoted text ("foo") is interpreted as a string in Spark and Hive, but as a column name in Presto.
Escaping special characters in string literals works differently in Spark and Presto. See this notebook for more details.
See also: Presto's guide to migrating from Hive

Integer division in Presto

If you divide integers, Hive and Spark will return a floating-point number if necessary (e.g. 1 / 3 returns 0.333333). However, Presto will return only an integer (e.g. 1 / 3 returns 0). Use CAST(x AS DOUBLE) to work around this. DOUBLE is a 64-bit floating point number, while REAL is a 32-bit floating point number.

There are some quirks to be aware of with this behavior:

SELECT
  2/5 AS "none",
  CAST(2 AS DOUBLE)/5 AS "numerator",
  2/CAST(5 AS DOUBLE) AS "denominator",
  CAST(2/5 AS DOUBLE) AS "outer",
  2/5 * CAST(100 AS DOUBLE) AS "percentage (a)",
  CAST(2/5 AS DOUBLE) * 100 AS "percentage (b)",
  CAST(2 AS DOUBLE) / 5 * 100 AS "percentage (c)",
  1.0 * 2 / 5 AS "percentage (d)"

These produce:

none: 0 (because 2/5 is rounded towards 0 to keep the output data type integer, same as input)
numerator, denominator: 0.4
outer: 0 (because 2/5 is implicitly cast to integer BEFORE being explicitly cast as double)
percentage
- (a): 0 (same as "none" – 2/5 is cast to int and rounded towards 0 before it reaches the double-typed 100)
- (b): 0 (same as outer)
- (c): 40
- (d): 40

So let's say your query has SUM(IF(event.action = 'click', 1, 0)) / COUNT(1) to calculate clickthrough rate. It'll be 0 unless you:

explicitly cast either the denominator or the numerator to double, or
implicitly cast by multiplying by 1.0 (for example above it follows order of operations: 1.0 * 2 becomes 2.0 then that gets divided by 5)

Table and file formats

Data Lake tables can be created using either Hive format or Iceberg format. Iceberg is the successor to Hive, and highly recommended for new tables. As of Feb 2024, the existing tables in the wmf database are being slowly migrated to Iceberg (task T333013).

Both table formats can store data using a variety of underlying file formats; we normally use Parquet with both Hive and Iceberg.

Specifying the table format

When writing tables with Spark, you specify the table format with the USING clause in a CREATE TABLE statement or using the format option in one of the APIs. The proper names of each format are hive and iceberg. Note that you can pass other values for the format (the data source, in Spark's terminology), but these are the two recommended ones.

Hive is the default table format. With either table format, Parquet is the default file format, with Snappy compression for Hive tables and Gzip compression for Iceberg tables.

Checking the table format

You can check the format of an existing table using DESCRIBE TABLE EXTENDED. The table format is shown in the Provider field.^[1]

Production databases

As Data Platform/Systems/Iceberg#Changes to database names mentions, many Hive tables would be split across functionally distinct databases in migration to Iceberg. Other databases have also been created since the original proposal.

The location of the individual tables in each database may differ from the location of the database. For example: the location of wmf_traffic.session_length is /wmf/data/wmf_traffic/session_length in HDFS.

Database	Functional use	Location	Sample of tables	Table formats
wmf_raw (DataHub)	Raw data (e.g. unrefined webrequests, sqooped MediaWiki tables)	wmf_raw.db	mediawiki_, mediawiki_private_, webrequest, cirrussearchrequestset, wikibase_, wikilambda_	Hive
wmf (DataHub)	Production datasets, including refined webrequests	wmf.db	webrequest, mediawiki_history, pageview_hourly, pageview_actor, virtualpageview_hourly, geoeditors_monthly	Hive
wmf_content (DataHub)	Data and metrics related to Content (e.g. wikitext)	wmf_content.db	mediawiki_content_history_v1	Iceberg
wmf_contributors (DataHub)	Data and metrics related to Contributors (e.g. editors, moderators)	wmf_contributors.db	editor_month, commons_edits	Iceberg
wmf_data_ops	Meta data about other tables, such as data quality	wmf_data_ops.db	data_quality_metrics, data_quality_alerts	Iceberg
wmf_experiments	For experiments conducted with Experimentation Lab	wmf_experiments.db	experiment_results_v1, experiment_configs_v1	Iceberg
wmf_product (DataHub)	Product health metrics	wmf_product.db	trust_safety_admin_action_monthly, automoderator_potential_vandalism_reverted, cx_suggestions_menu_interactions_daily	Hive & Iceberg
wmf_readership (DataHub)	Data and metrics related to Readership	wmf_readership.db	unique_devices_per_domain_monthly, unique_devices_per_project_family_daily	Iceberg
wmf_traffic (DataHub)	Metrics related to traffic to Wikimedia properties	wmf_traffic.db	referrer_daily, browser_general, session_length	Iceberg

Unless stated otherwise, each database is located inside hdfs://analytics-hadoop/user/hive/warehouse/

Technical architecture

Data Lake datasets are stored in the Hadoop Distributed File System (HDFS). The Hive metastore is a centralized repository for metadata about these data files, and all three SQL query engines we use (Presto, Spark SQL, and Hive) rely on it.

Some Data Lake datasets are available in Druid, which is separate from from the Data Lake and HDFS, and allows quick exploration and dashboarding of those datasets in Turnilo and Superset.

The Analytics cluster, which consists of Hadoop servers and related components, provides the infrastructure for the Data Lake.

All Subpages of Data Platform/Data Lake

↑ Weirdly, the Provider field is unreliable if you get it from SHOW TABLE EXTENDED. In that case, Iceberg tables show hive.

[1] Weirdly, the Provider field is unreliable if you get it from SHOW TABLE EXTENDED. In that case, Iceberg tables show hive.

[1]