Multi-Check (References) AB Test Report

Author

Affiliation

Megan Neisler

Product Analytics, Wikimedia Foundation

Published

May 23, 2025

Modified

May 28, 2025

Overview

The Wikimedia Foundation’s Editing team is working on a set of improvements for the visual editor to help new volunteers understand and follow some of the policies necessary to make constructive changes to Wikipedia projects.

This work is guided by the Wikimedia Foundation Annual Plan, specifically by Wiki Experiences 1.2: “Widespread deployment of interventions shown to collectively cause a 10% relative increase (y-o-y) on mobile web and a 25% relative increase (y-o-y) on iOS of newcomers who publish ≥1 constructive edit in the main namespace on a mobile device, as measured by controlled experiments.”

In this AB test, we are evaluating the impact of showing multiple Reference Checks within a single editing session. An editing session is defined as a period of activity starting with a contributor clicking an edit button and ending when they publish or abandon the edit. The Reference Check invites users who have added more than 50 new characters to an article namespace to include a reference to the edit they’re making if they have not already done so themselves at the time they indicate their intent to save.

In the current default experience, a single Reference Check is presented even in cases when the edit someone is attempting may warrant multiple references (e.g. adding new sentences in separate sections). The Multi-Check references experience removes this constraint and allows multiple Reference Checks to be presented in a single edit when the edit they are attempting warrants them.

The findings from this A/B test will be relevant for the near-term future where multiple Edit Checks of the same and/or different types (e.g. Peacock Check, Paste Check, etc.) have the potential to become activated within a single edit session.

You can find more information about features of this tool and project updates on the project page.

Methodology

The team ran an AB test from 25 March 2025 through 15 May 2025 to determine the impact of presenting multiple Reference Checks within a single session.

Specifically, we want to learn what – if any – changes in edit quality and completion do we observe when people have the potential to see multiple Reference Checks within a single edit. More details on the measurement plan and decision scenarios are documented in the task desciption.

During this experiment, 50% of users editing a desktop or mobile main namespace page using Visual Editor were randomly assigned to the test group and could be shown multiple Reference Checks if their edit met the specified requirements during their edit, and 50% were randomly assigned to the control group and only shown one Reference Check in an editing session even if their edit warranted multiple references (the default editing experience at partner wikis).

The test included all mobile web and desktop contributors (both registered and unregistered) to the 12 participating wikis that started an edit with Visual Editor (see full list of participating Wikipedias on the this task description). Users remained in the same test group for the duration of the test. We also limited the analysis to edits completed by unregistered users and users with 100 or fewer edits as those are the users that would be shown Reference Check under the default config settings. Edits completed with Visual Editor account for about 48% of all main namespace edits by these users at the partner wikis.

Figure1: Multi-Check AB Test Bucketing Overview

As shown in Figure 1, not all edits bucketed in the AB test experiment met the requirements for being shown one or multiple Reference Checks. A Reference Check was only shown if the contributor met the specified requirements at the time they indicated their intent to save by clicking the pre-publish button. If the user was in the test group, they could be shown multiple checks if their edit warrented multiple references.

Please refer to the data collection notebook notebook and the data_processing notebook for more details on the steps to collect and process the data reviewed in this report.

Overview of data sample

Reference check was presented at least once at 6,435 new content edits published in the control group and 6,277 new content edits published in the test group.
In the test group, multiple reference checks were presented in a single editing session at 27% (1,697 new content edits) of all published new content edits where reference check was activated.
For edits shown multiple checks, the majority of these edits (72.8%) were shown between 2 to 5 reference checks with a single session. 5% of these multi-check edits ( 87 new content edits) were shown over 16 reference checks within a single session.

For each key performance metrics and secondary metrics, we reviewed the following dimensions:

overall by experiment group (test and control),
by platform (mobile web or desktop),
by user experience and status,
and by partner Wikipedia.

We also compared edits shown multiple Reference Checks in a single session in the test group to edits that were only presented a single Reference Check. For edits presented more than one Reference Check, we reviewed a split by the number of checks shown to determine if there was a significant metric change at a certain number of checks presented.

Summary of Results

KPI 1: Proportion of new content edits with a reference. Users are more likely to include at least one reference with their new content edits when multi-check (references) is available. We confirmed a statistically significant 5.9% increase in the proportion of all new content edits with a reference in the test group where multi-check was available compared to all new content edits completed in the control group. We observed similar increases for edits published on desktop and mobile web. Edits that were shown multiple Reference Checks in a session are 1.3 times more likely to include at least one new reference in the final published edit compared to sessions shown a single Reference Check.
KPI 2: Revert Rate: Overall, we did not identify any significant changes in new content edit revert rate between the control and test group. However, there was a -34.7% decrease in revert rate when directly comparing edits presented multiple checks compared to edits presented a single reference check. This decrease is likely in part because the types of edits that warrant multiple Reference Checks are less likely to be reverted than the types of edits that warrant only a single check.
Secondary Metric: Constructive Activation: We did not identify any significant changes in overall constructive activation rates on desktop or mobile web due to introduction of mult-check. Overall, the constructive activation rate was 17.7% in the control group and 17.6% in the test group.
Secondary Metric: Proportion of users that publish at least one new content edit with a reference Overall, there was a 5.5% increase in the proportion of distinct users who published a new content edit with a reference when multi-check was available. We also found that more users were likely to publish a new content edit with a reference on desktop compared to mobile web when multi-check is available. We observed a 6.2% increase in the proportion of distinct users that published a new content edit with a reference on desktop compared to no statistically signficant increases on mobile web.
Guardrail Summary:
We did not observe any significant changes in the identified guardrails to indicate that the introduction of multi-check is negatively impacting the user’s editing experience. There were no decreases in edit completion rate for up to 5 reference checks being presented in a single session (which accounts for the majority of multi-check edits). We also did not identify any increases in revert rate at any number of reference checks presented.

Code

# load packages
shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))
shhh({
    library(lubridate)
    library(ggplot2)
    library(dplyr)
    library(gt)
    library(IRdisplay)
    library(tidyr)
    # Modeling 
    library(brms)
    library(lme4)
    library(tidybayes)
    set.seed(5)
})
#set preferences
options(dplyr.summarise.inform = FALSE)
options(repr.plot.width = 15, repr.plot.height = 10)

Key Performance Indicator 1: Proportion of published new content edits that include a reference

Hypothesis: The quality of new content edits users make in the main namespace will increase because a greater percentage of these edits will include a reference.

Methodology: We reviewed the proportion of published new content edits (editcheck-newcontent) where people were were shown at least one Reference Check and included at least one net new reference (editcheck-newreference). Please see the edit tag mediawiki page for more details on how these tags are applied.

Code

# load edit check new content data
edit_check_publish_data <-
  read.csv(
    file = 'data/edit_check_publish_data_final.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = TRUE
  )

Code

# Set experience level group and factor levels
edit_check_publish_data <- edit_check_publish_data %>%
  mutate(
    experience_level_group = case_when(
     user_edit_count == 0 & user_status == 'registered' ~ 'Newcomer',
     user_edit_count == 0 & user_status == 'unregistered' ~ 'Unregistered',
      user_edit_count > 0 &  user_edit_count <= 100 ~ "Junior Contributor",
      user_edit_count >  100 ~ "Non-Junior Contributor"   
    ),
    experience_level_group = factor(experience_level_group,
         levels = c("Unregistered","Newcomer", "Non-Junior Contributor", "Junior Contributor")
   ))  

#rename experiment fields to clarify
edit_check_publish_data <- edit_check_publish_data %>%
  mutate(test_group = factor(test_group,
         levels = c("2025-03-editcheck-multicheck-reference-control", "2025-03-editcheck-multicheck-reference-test"),
         labels = c("control (single check)", "test (multiple checks)")))

Code

#Set fields and factor levels to assess number of checks shown
#Note limited to 1 sidebar open as we're looking for cases where multiple checks presented in a single sidebar presented at save attempt (vs user going back and forth between save attempt moments)

edit_check_publish_data <- edit_check_publish_data %>%
  mutate(
    multiple_checks_shown = 
         ifelse(n_checks_shown > 1 &  n_sidebar_opens < 2, "multiple checks shown", "single check shown"),  
     multiple_checks_shown = factor( multiple_checks_shown ,
         levels = c("single check shown", "multiple checks shown")))
         
# note these buckets can be adjusted as needed based on distribution of data
edit_check_publish_data <- edit_check_publish_data %>%
  mutate(
    checks_shown_bucket = case_when(
     is.na(n_checks_shown) ~ '0',
     n_checks_shown == 1 | (n_checks_shown > 1 & n_sidebar_opens >= 2)  ~ '1', 
     n_checks_shown == 2 & n_sidebar_opens < 2 ~ '2',
     n_checks_shown > 2 & n_checks_shown <= 5 & n_sidebar_opens < 2 ~ "3-5",
     n_checks_shown > 5 & n_checks_shown <= 10 & n_sidebar_opens < 2 ~ "6-10", 
     n_checks_shown > 10 & n_checks_shown <= 15 & n_sidebar_opens < 2 ~ "11-15", 
     n_checks_shown > 15 & n_checks_shown <= 20 & n_sidebar_opens < 2 ~ "16-20", 
    n_checks_shown > 20 & n_sidebar_opens < 2 ~ "over 20" 
    ),
    checks_shown_bucket = factor(checks_shown_bucket ,
         levels = c("0","1","2", "3-5", "6-10","11-15" ,"16-20", "over 20")
   ))

Overall by Experiment Group

Code

published_edits_reference_overall <- edit_check_publish_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1) %>% #limit to new content edits where reference check show
    group_by(test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_edits_wref = n_distinct(editing_session[included_new_reference == 1])) %>% #limit to new content edits without a refernece
    mutate(prop_edits = paste0(round(n_edits_wref/n_edits * 100, 1), "%"))

Code


published_edits_reference_overall_table <- published_edits_reference_overall_const %>%   
    gt()  %>%
    tab_header(
    title = "New content edits where reference check was shown and that include a new reference"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    n_edits = "Number of new content edits shown reference check",
    n_edits_wref = "Number of new content edits with new reference",
    prop_edits = "Proportion of new content edits with a new reference"
  ) %>%
    tab_source_note(
        gt::md('Limited to new content edits where at least one reference check was shown')
    )


display_html(as_raw_html(published_edits_reference_overall_table))

New content edits where reference check was shown and that include a new reference
Experiment Group	Number of new content edits shown reference check	Number of new content edits with new reference	Proportion of new content edits with a new reference
control (single check)	6435	2623	40.8%
test (multiple checks)	6277	2712	43.2%
Limited to new content edits where at least one reference check was shown

Code


dodge <- position_dodge(width=0.9)

p <- published_edits_reference_overall %>%
    ggplot(aes(x= test_group, y = n_edits_wref/n_edits, fill = test_group)) +
    geom_col(position = 'dodge') +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste(prop_edits, "\n", n_edits_wref,"edits"), fontface=2), vjust=1.2, size = 10, color = "white") +
    labs (y = "Percent of new content edits ",
           x = "Experiment Group",
          title = "New content edits that include a new reference",
           caption = "Limited to new content edits where at least one reference check was shown")  +
    scale_fill_manual(values= c("#999999", "dodgerblue4"), name = "Experiment Group")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=24),
        legend.position= "none",
        axis.line = element_line(colour = "black")) 
      
p

There was a statistically signficant 5.9% increase (2.4 percentage points) in the proportion of new content edits with a reference in the test group where multi-check was available to users. This includes all new content edits where at least one reference check was shown.

By if multiple reference checks were shown

Code


published_edits_reference_ifmultiple <- edit_check_publish_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1) %>% #limit to new content edits where reference check show
    group_by(test_group, multiple_checks_shown) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_edits_wref = n_distinct(editing_session[included_new_reference == 1])) %>% #limit to new content edits without a refernece
    mutate(prop_edits = paste0(round(n_edits_wref/n_edits * 100, 1), "%"))

Code



published_edits_reference_ifmultiple_table <- published_edits_reference_ifmultiple  %>%   
    gt()  %>%
    tab_header(
    title = "New content edits that include a new reference"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    multiple_checks_shown = "Multiple checks shown",
    n_edits = "Number of new content edits shown reference check",
    n_edits_wref = "Number of new content edits with new reference",
    prop_edits = "Proportion of new content edits with a new reference"
  ) %>%
    tab_source_note(
        gt::md('Limited to new content edits where at least one reference check was shown')
    )


display_html(as_raw_html(published_edits_reference_ifmultiple_table ))

New content edits that include a new reference
Multiple checks shown	Number of new content edits shown reference check	Number of new content edits with new reference	Proportion of new content edits with a new reference
control (single check)
single check shown	6435	2623	40.8%
test (multiple checks)
single check shown	4580	1788	39%
multiple checks shown	1697	924	54.4%
Limited to new content edits where at least one reference check was shown

Code



p <- published_edits_reference_ifmultiple %>%
    ggplot(aes(x= multiple_checks_shown, y = n_edits_wref/n_edits, fill = test_group)) +
    geom_col( position = position_dodge2(preserve = "single"))  +
    #facet_grid(~multiple_checks_shown) +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste(prop_edits, "\n", n_edits_wref,"edits"), fontface=2), position = position_dodge(width = 1), vjust=1.2, size = 8, color = "white") +
    labs (y = "Percent of new content edits ",
           x = "Experiment Group",
          title = "New content edits that include a new reference \n by if multiple reference checks were shown",
           caption = "Limited to new content edits where at least one reference check was shown")  +
    scale_fill_manual(values= c("#999999", "dodgerblue4"), name = "Experiment group")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=24),
        legend.position= "bottom",
        axis.line = element_line(colour = "black")) 
      
p

We also compared editing sessions where multiple Reference Checks were shown to sessions shown only a single Reference Check was presented. Edits presented a single Reference Check in both the control and test group have the same rate of adding references as those experiences are identical.

Edits are shown multiple checks are 1.3 times more likely to include at least one new reference in the final published edit compared to sessions shown a single reference check.

By number of checks shown

Code


published_edits_reference_nchecks <- edit_check_publish_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1 & test_group == 'test (multiple checks)') %>% #limit to new content edits where reference check show
    group_by(checks_shown_bucket) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_edits_wref = n_distinct(editing_session[included_new_reference == 1])) %>% #limit to new content edits without a refernece
    mutate(prop_edits = paste0(round(n_edits_wref/n_edits * 100, 1), "%")) %>%  
    mutate(n_edits_sanitized = ifelse(n_edits < 50, "<50", n_edits),
    n_edits_wref_sanitized = ifelse(n_edits_wref < 50, "<50", n_edits_wref))   #sanitizing per data publication guidelines

Code


dodge <- position_dodge(width=0.9)

p <- published_edits_reference_nchecks %>%
    ggplot(aes(x= checks_shown_bucket, y = n_edits_wref/n_edits)) +
    geom_col(position = 'dodge', fill = "dodgerblue4") +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste(prop_edits, "\n", n_edits_wref_sanitized,"edits"), fontface=2), vjust=1.2, size = 7, color = "white") +
    labs (y = "Percent of new content edits ",
           x = "Number of reference checks shown",
          title = "New content edits that include a new reference \n by number of reference checks shown",
           caption = "Limited to new content edits where at least one new reference check was shown")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=20),
        legend.position= "bottom",
        axis.line = element_line(colour = "black")) 
      
p

The proportion of edits with a new reference generally increases with an increasing number of checks shown. 57% of edits presented between 6 to 10 Reference Checks included a new reference compared to 39% of edits presented a single Reference Check.

The rate of increase appears to start to diminish around 11 to 15 checks; however, there was also a limited number of edits (107 edits) where more than 10 Reference Checks were presented in a single editing session.

By Platform

Code


published_edits_reference_platform <- edit_check_publish_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1) %>% #limit to new content edits where reference check show
    group_by(platform, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_edits_wref = n_distinct(editing_session[included_new_reference == 1])) %>% #limit to new content edits without a refernece
    mutate(prop_edits = paste0(round(n_edits_wref/n_edits * 100, 1), "%"))

Code



published_edits_reference_platform_table <- published_edits_reference_platform  %>%   
    gt()  %>%
    tab_header(
    title = "New content edits that include a new reference by platform"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    platform = "Platform",
    n_edits = "Number of new content edits shown reference check",
    n_edits_wref = "Number of new content edits with new reference",
    prop_edits = "Proportion of new content edits with a new reference"
  ) %>%
    tab_source_note(
        gt::md('Limited to new content edits where at least one reference check was shown')
    )


display_html(as_raw_html(published_edits_reference_platform_table))

New content edits that include a new reference by platform
Experiment Group	Number of new content edits shown reference check	Number of new content edits with new reference	Proportion of new content edits with a new reference
Desktop
control (single check)	3911	1902	48.6%
test (multiple checks)	3821	1971	51.6%
Mobile Web
control (single check)	2524	721	28.6%
test (multiple checks)	2456	741	30.2%
Limited to new content edits where at least one reference check was shown

Code


dodge <- position_dodge(width=0.9)

p <- published_edits_reference_platform %>%
    ggplot(aes(x= test_group, y = n_edits_wref/n_edits, fill = test_group)) +
    geom_col(position = 'dodge') +
    facet_grid(~platform) +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste(prop_edits, "\n", n_edits_wref,"edits"), fontface=2), vjust=1.2, size = 8, color = "white") +
    labs (y = "Percent of new content edits ",
           x = "Experiment Group",
          title = "New content edits that include a new reference \n by platform",
           caption = "Limited to new content edits where at least one reference check was shown")  +
    scale_fill_manual(values= c("#999999", "dodgerblue4"), name = "Experiment group")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=24),
        axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        legend.position= "bottom",
        axis.line = element_line(colour = "black")) 
      
p

There were similar increases in new content edits with a reference on desktop and mobile web for users:

Desktop: 6.2% increase in the proportion of new content edits with a reference.
Mobile Web: 5.6% increase in the proportion of new content edits with a reference.

Overall, edits completed on mobile web are less likely to include a new reference compared to edits completed on desktop.

By User Experience

Code


published_edits_reference_userexp <- edit_check_publish_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1) %>% #limit to new content edits where reference check show
    group_by(experience_level_group, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_edits_wref = n_distinct(editing_session[included_new_reference == 1 ])) %>% #limit to new content edits without a refernece
    mutate(prop_edits = paste0(round(n_edits_wref/n_edits * 100, 1), "%"))

Code


dodge <- position_dodge(width=0.9)

p <- published_edits_reference_userexp %>%
    ggplot(aes(x= test_group, y = n_edits_wref/n_edits, fill = test_group)) +
    geom_col(position = 'dodge') +
    facet_grid(~experience_level_group) +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste(prop_edits, "\n", n_edits_wref,"edits"), fontface=2), vjust=1.2, size = 7, color = "white") +
    labs (y = "Percent of new content edits ",
           x = "Experiment Group",
          title = "New content edits that include a new reference \n by user experience",
           caption = "Limited to new content edits where at least one reference check was shown ")  +
    scale_fill_manual(values= c("#999999", "dodgerblue4"), name = "Experiment group")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        text = element_text(size=20),
        legend.position= "bottom",
        axis.line = element_line(colour = "black")) 
      
p

We also observed similar increases across all reviewed user types (unregistered contributors, newcomers, and Junior Contributors).

By Partner Wikipedia

Code


published_edits_reference_wiki <- edit_check_publish_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1) %>% #limit to new content edits where reference check show
    group_by(wiki, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_edits_wref = n_distinct(editing_session[included_new_reference == 1])) %>% #limit to new content edits without a refernece
    mutate(prop_edits = paste0(round(n_edits_wref/n_edits * 100, 1), "%"))

Code


dodge <- position_dodge(width=0.9)

p <- published_edits_reference_wiki %>%
    filter(!wiki %in% c('Afrikaans Wikipedia', 'Igbo Wikipedia', 'Swahili Wikipedia', 'Yoruba Wikipedia')) %>% # remove wikis with inufficient events
    ggplot(aes(x= test_group, y = n_edits_wref/n_edits, fill = test_group)) +
    geom_col(position = 'dodge') +
    facet_wrap(~wiki, nrow=2) +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste(prop_edits), fontface=2), vjust=1.2, size = 6, color = "white") +
    labs (y = "Percent of new content edits ",
           x = "Experiment Group",
          title = "New content edits that include a new reference \n by partner Wikipedia",
           caption = "Includes all new content edits where at least one reference check was shown. \n Excludes smaller wikis where insufficient events were logged.")  +
    scale_fill_manual(values= c("#999999", "dodgerblue4"), name = "Experiment group")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        text = element_text(size=24),
        legend.position= "bottom",
        axis.line = element_line(colour = "black")) 
      
p

Results vary by partner Wikipedia. We observed increases in the proportion of new content edits that include a reference across half the partner Wikipedias with the highest increase observed at Spanish Wikipedia (24.5% increase [8 percentage points]).

Modeling the impact of multi-check on whether a new content edit includes a reference

We next explored different models to correctly infer the impact of offering multi-check on the liklihood a new content edit will include a reference while also accounting for variability across different wikis and user. This allows us to confirm if the observed increase above is statistically significant (did not occur due to random chance).

We used a Bayesian Hierarchical regression model to model this structure. For this model, we identified whether at least one new reference was included as the response variable, the user’s assigned test group as the predictor variable, and the user and Wikipedia as random effects.

Code

# limit to new content edits where reference check was shown
edit_check_publish_data_model <- edit_check_publish_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1)

Code

#redefine including a reference as factor for use in the model
edit_check_publish_data_model$included_new_reference <-
  factor(
    edit_check_publish_data_model$included_new_reference,
    levels = c(0, 1)
  )

Code

priors <- c(
  set_prior(prior = "std_normal()", class = "b"),
  set_prior("cauchy(0, 5)", class = "sd")
)

Code

fit <- brm(
  included_new_reference ~ test_group + (1 | wiki/user_id),
  family = bernoulli(link = "logit"),
  data = edit_check_publish_data_model,
  prior = priors,
  chains = 4, cores = 4
)

Code

fit_tbl <- fit  %>%
  spread_draws(b_test_grouptestmultiplechecks, b_Intercept) %>%
  mutate(
    exp_b = exp(b_test_grouptestmultiplechecks),
    b4 = b_test_grouptestmultiplechecks/ 4,
    avg_lift =  plogis(b_Intercept + b_test_grouptestmultiplechecks) - plogis(b_Intercept)
  ) %>%
  pivot_longer(
    b_test_grouptestmultiplechecks:avg_lift,
    names_to = "param",
    values_to = "val"
  ) %>%
  group_by(param) %>%
  summarize(
    ps = c(0.025, 0.5, 0.975),
    qs = quantile(val, probs = ps),
    .groups = "drop"
  ) %>%
  mutate(
    quantity = ifelse(
      param %in% c("b_Intercept", "b_test_grouptestmultiplechecks"),
      "Parameter", "Function of parameter(s)"
    ),
    param = factor(
      param,
      c("b_Intercept", "b_test_grouptestmultiplechecks", "exp_b", "b4", "avg_lift"),
      c("(Intercept)", "Test Group (Multiple Checks Available)", "Multiplicative effect on odds", "Maximum Lift", "Average lift")
    ),
    ps = factor(ps, c(0.025, 0.5, 0.975), c("lower", "median", "upper")),
  ) %>%
  pivot_wider(names_from = "ps", values_from = "qs") %>%
  arrange(quantity, param)

Code

fit_tbl %>%
  gt(rowname_col = "param", groupname_col = "quantity") %>%
  row_group_order(c("Parameter", "Function of parameter(s)")) %>%
  fmt_number(c(lower, median, upper), decimals = 3) %>%
  fmt_percent(columns = c(median, lower, upper), rows = 2:3, decimals = 1) %>%
  cols_align("center", c(median, lower, upper)) %>%
  cols_merge(c(lower, upper), pattern = "({1}, {2})") %>%
  cols_move_to_end(c(lower)) %>%
  cols_label(median = "Point Estimate", lower = "95% CI") %>%
  tab_style(cell_text(weight = "bold"), cells_row_groups()) %>%
  tab_footnote("CI: Credible Interval", cells_column_labels(c(lower))) %>%
  tab_footnote(
    html("Average lift = Pr(Success|Multi-Check) - Pr(Success|Single Check) = logit<sup>-1</sup>(&beta;<sub>0</sub> + &beta;<sub>1</sub>) - logit<sup>-1</sup>(&beta;<sub>0</sub>)"),
    cells_body(c(median), 3)
  ) %>%
  tab_footnote(
    html("Maximum lift calculated using the divide-by-4-rule"),
    cells_body(c(median), 2)
  ) %>%
  tab_header("New content edits with a reference: Summary of model parameters")  %>%
      gtsave(
    "fit_tbl.html", inline_css = TRUE)


IRdisplay::display_html(file = "fit_tbl.html")

New content edits with a reference: Summary of model parameters
	Point Estimate	95% CI¹
Parameter
(Intercept)	−0.551	(−1.149, 0.052)
Test Group (Multiple Checks Available)	0.217	(0.053, 0.404)
Function of parameter(s)
Multiplicative effect on odds	1.243	(1.054, 1.498)
Maximum Lift	5.4%²	(1.3%, 10.1%)
Average lift	5.1%³	(1.2%, 9.5%)
¹ CI: Credible Interval
² Maximum lift calculated using the divide-by-4-rule
³ Average lift = Pr(Success\|Multi-Check) - Pr(Success\|Single Check) = logit^-1(β₀ + β₁) - logit^-1(β₀)

Interpretation of Model Results

Since the model parameters are on the log-odds scale, we needed to apply the following transformations to make sense of them.

We used the “divide-by-4” rule suggested by Gelman, Hill, and Vehtari 2021 ¹ to approximate the maximum increase in the probability of success corresponding to which editing interface (new topic tool or previous new section link workflow) was used. Using the bayesian model, we can also directly calculate the average lift.
Since the model parameters are on the log-odds scale, we need to take the exponentiation of the effect (exp(β₁)) to determine the multiplicative effect on the odds of a Junior Contributor successfully publishing at least 1 new topic.

Based on estimates from the model, we found that edits where multi-check is available are 1.2 times more likely to include a new reference in their new content edit.

We also found there is an average 5.1% increase (maximum 5.4% increase) in the probability of an edit including a new reference when switching from the single check experience to the multi-check experience. We can confirm statistical significance at the 0.05 level for all of these estimates (as indicated by credible intervals that do not cross 1).

Key Insights

There was 5.9% increase in the proportion of new content edits with a reference in the test group where multi-check was available to users.
Sessions shown multiple checks are more likely to include at least one new reference in the final published edit compared to sessions shown a single reference check. For edits where multiple checks were presented, there was a 36% increase in the proportion of new content edits with a reference compared to edits where only one check was presented. It’s worth noting that edits that would warrant multiple reference checks are likely larger and also more likely to include at least one new reference.
The proportion of edits with a new reference generally increases with an increasing number of checks shown. 57% of edits presented between 6 to 10 reference check included a new reference compared to 40% of edits presented a single reference check.
Increases were across all all reviewed user types, platforms, and at half the partner wikis. We observed similar increases on desktop and mobile web.

KPI 2: Proportion of published edits that add new content and are reverted within 48 hours

We also reviewed revert rate to determine the impact of introducing multi-check on the quality of edits being published.

Hypothesis: The quality of new content edits newcomers and Junior Contributors make in the main namespace will increase because a greater percentage of these edits will include a reference or an explicit acknowledgement as to why these edits lack references.

Methodology We reviewed the proportion of all new content edits in the control and test groups that were reverted within 48 hours. We limited the analysis to new content edits where at least one reference check was shown.

Overall by experiment group

Code

published_edits_reverted_overall <- edit_check_publish_data %>%
   filter(was_edit_check_shown == 1 & is_new_content == 1) %>% #limit new content edits
    group_by(test_group) %>%
    summarise(n_content_edits = n_distinct(editing_session),
             n_reverted_edits = n_distinct(editing_session[was_reverted == 1]))  %>%  #look at reverts
     mutate(prop_edits = paste0(round(n_reverted_edits/n_content_edits * 100, 1), "%"))

Code


published_edits_reverted_overall_table <- published_edits_reverted_overall  %>%   
    gt()  %>%
    tab_header(
    title = "New content edit revert rate of edits shown Reference Check"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment group",
    n_content_edits = "Number of new content edits",
    n_reverted_edits = "Number of new content edits reverted",
    prop_edits = "Proportion of new content edits reverted"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits where at least one reference check was shown')
    )


display_html(as_raw_html(published_edits_reverted_overall_table))

New content edit revert rate of edits shown reference check
Experiment group	Number of new content edits	Number of new content edits reverted	Proportion of new content edits reverted
control (single check)	6435	1448	22.5%
test (multiple checks)	6277	1481	23.6%
Limited to published new content edits where at least one reference check was shown

Code


dodge <- position_dodge(width=0.9)

p <- published_edits_reverted_overall %>%
    ggplot(aes(x= test_group, y = n_reverted_edits/n_content_edits, fill = test_group)) +
    geom_col(position = 'dodge') +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste(prop_edits, "\n", n_reverted_edits,"reverted edits"), fontface=2), vjust=1.2, size = 8, color = "white") +
    labs (y = "Percent of new content edits reverted ",
           x = "Experiment Group",
          title = "New content edit revert rate of edits shown reference check",
           caption = "Limited to published new content edits where at least one reference check was shown")  +
    scale_fill_manual(values= c("#999999", "dodgerblue4"), name = "Experiment Group")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=20),
        legend.position= "none",
        axis.line = element_line(colour = "black")) 
      
p

Overall across all new content edits where reference check was shown, we observed a slight 5% increase (1 percentage points) in the revert rate of new content edits shown at least one reference check when multiple reference checks were available to elibile edits (test group). These results are not statistically significant (p-value of 0.0719).
We observed slight increases for edits published with and without a new reference indicating that observed overall increases were due to chance.

By if multiple reference checks were shown

Code

published_edits_reverted_ifmultiple <- edit_check_publish_data %>%
   filter(was_edit_check_shown == 1 & is_new_content == 1) %>% #limit new content edits
    group_by(test_group, multiple_checks_shown) %>%
    summarise(n_content_edits = n_distinct(editing_session),
             n_reverted_edits = n_distinct(editing_session[was_reverted == 1]))  %>%  #look at reverts
     mutate(prop_edits = paste0(round(n_reverted_edits/n_content_edits * 100, 1), "%"))

Code


published_edits_reverted_ifmultiple_table <- published_edits_reverted_ifmultiple  %>%   
    gt()  %>%
    tab_header(
    title = "New content edit revert rate of edits shown reference check by if multiple checks shown"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment group",
    multiple_checks_shown = "Multiple checks shown",
    n_content_edits = "Number of new content edits",
    n_reverted_edits = "Number of new content edits reverted",
    prop_edits = "Proportion of new content edits reverted"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits where at least one reference check was shown')
    )


display_html(as_raw_html(published_edits_reverted_ifmultiple_table))

New content edit revert rate of edits shown reference check by if multiple checks shown
Multiple checks shown	Number of new content edits	Number of new content edits reverted	Proportion of new content edits reverted
control (single check)
single check shown	6435	1448	22.5%
test (multiple checks)
single check shown	4580	1213	26.5%
multiple checks shown	1697	268	15.8%
Limited to published new content edits where at least one reference check was shown

Code


dodge <- position_dodge(width=0.9)

p <- published_edits_reverted_ifmultiple %>%
    ggplot(aes(x= multiple_checks_shown, y = n_reverted_edits/n_content_edits, fill = test_group)) +
    geom_col( position = position_dodge2(preserve = "single"))  +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste(prop_edits, "\n", n_reverted_edits,"\n reverted edits"), fontface=2), position = position_dodge(width = 1), vjust=1.2, size = 6.5, color = "white") +
    labs (y = "Percent of new content edits reverted ",
           x = "Experiment Group",
          title = "New content edit revert rate by \n if multiple reference checks were shown",
           caption = "Limited to new content edits where at least one reference check was shown")  +
    scale_fill_manual(values= c("#999999", "dodgerblue4"), name = "Experiment group")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=24),
        axis.title.x=element_blank(),
        legend.position= "bottom",
        axis.line = element_line(colour = "black")) 
      
p

There was a -34.7% decrease in revert rate when directly comparing edits presented multiple checks compared to edits presented a single reference check. This decrease is likely in part because the types of edits that warrant multiple reference checks are less likely to be reverted than the types of edits that warrant only a single check.

By number of reference checks shown

Code

published_edits_reverted_nchecks <- edit_check_publish_data %>%
   filter(was_edit_check_shown == 1 & is_new_content == 1) %>% #limit new content edits
    group_by(checks_shown_bucket) %>%
    summarise(n_content_edits = n_distinct(editing_session),
             n_reverted_edits = n_distinct(editing_session[was_reverted == 1]))  %>%  #look at reverts
     mutate(prop_edits = paste0(round(n_reverted_edits/n_content_edits * 100, 1), "%"))  %>%  
    mutate(n_content_edits_sanitized = ifelse(n_content_edits < 50, "<50", n_content_edits),
    n_reverted_edits_sanitized = ifelse(n_reverted_edits < 50, "<50", n_reverted_edits))   #sanitizing per data publication guidelines

Code


dodge <- position_dodge(width=0.9)

p <- published_edits_reverted_nchecks %>%
    ggplot(aes(x= checks_shown_bucket, y = n_reverted_edits/n_content_edits)) +
    geom_col(position = 'dodge', fill = 'dodgerblue4') +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste(prop_edits, "\n", n_reverted_edits_sanitized,"\n reverted edits"), fontface=2), vjust=1.2, size = 5, color = "white") +
    labs (y = "Percent of new content edits reverted ",
           x = "Number of reference check shown",
          title = "New content edit revert rate by number of Reference Checks shown",
           caption = "Limited to published new content edits where at least one reference check was shown")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=24),
        legend.position= "bottom",
        axis.line = element_line(colour = "black")) 
p

We observed revert rate decreases with an increasing number of checks presented. The revert rate of edits presented between 6 to 10 reference check is 11% compared a revert rate of 24% for edits presented a single check.

We did not identifiy any significant increases in the revert rate at any number of checks presented; however,there is a limited sample of edits presented over 10 reference checks so more data would be needed to confirm these trends.

By Platform

Code

published_edits_reverted_platform <- edit_check_publish_data %>%
   filter(was_edit_check_shown == 1 & is_new_content == 1) %>% #limit new content edits
    group_by(platform, test_group) %>%
    summarise(n_content_edits = n_distinct(editing_session),
             n_reverted_edits = n_distinct(editing_session[was_reverted == 1]))  %>%  #look at reverts
     mutate(prop_edits = paste0(round(n_reverted_edits/n_content_edits * 100, 1), "%"))

Code


published_edits_reverted_platform_table <- published_edits_reverted_platform  %>%   
    gt()  %>%
    tab_header(
    title = "New content edit revert rate of edits shown reference check by platform"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment group",
   platform = "Platform",
    n_content_edits = "Number of new content edits",
    n_reverted_edits = "Number of new content edits reverted",
    prop_edits = "Proportion of new content edits reverted"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits where at least one reference check was shown')
    )


display_html(as_raw_html(published_edits_reverted_platform_table))

New content edit revert rate of edits shown reference check by platform
Experiment group	Number of new content edits	Number of new content edits reverted	Proportion of new content edits reverted
Desktop
control (single check)	3911	594	15.2%
test (multiple checks)	3821	620	16.2%
Mobile Web
control (single check)	2524	854	33.8%
test (multiple checks)	2456	861	35.1%
Limited to published new content edits where at least one reference check was shown

Code


dodge <- position_dodge(width=0.9)

p <- published_edits_reverted_platform %>%
    ggplot(aes(x= test_group, y = n_reverted_edits/n_content_edits, fill = test_group)) +
    geom_col(position = 'dodge') +
    facet_grid(~platform) +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste(prop_edits, "\n", n_reverted_edits,"\n reverted edits"), fontface=2), vjust=1.2, size = 7, color = "white") +
    labs (y = "Percent of new content edits reverted ",
           x = "Experiment Group",
          title = "New content edit revert rate by platform",
           caption = "Limited to published new content edits where at least one reference check was shown")  +
    scale_fill_manual(values= c("#999999", "dodgerblue4"), name = "Experiment group")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=20),
        axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        legend.position= "bottom",
        axis.line = element_line(colour = "black")) 
      
p

There were no statistically signficant increases in revert rate overall by experiment group for desktop or mobile web.

By User Experience

Code

published_edits_reverted_userexp <- edit_check_publish_data %>%
   filter(was_edit_check_shown == 1 & is_new_content == 1) %>% #limit new content edits
    group_by(experience_level_group, test_group) %>%
    summarise(n_content_edits = n_distinct(editing_session),
             n_reverted_edits = n_distinct(editing_session[was_reverted == 1]))  %>%  #look at reverts
     mutate(prop_edits = paste0(round(n_reverted_edits/n_content_edits * 100, 1), "%"))

Code


dodge <- position_dodge(width=0.9)

p <- published_edits_reverted_userexp %>%
    ggplot(aes(x= test_group, y = n_reverted_edits/n_content_edits, fill = test_group)) +
    geom_col(position = 'dodge') +
    facet_grid(~experience_level_group) +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste(prop_edits, "\n", n_reverted_edits,"\n reverted edits"), fontface=2), vjust=1.2, size = 6, color = "white") +
    labs (y = "Percent of new content edits reverted ",
           x = "Experiment Group",
          title = "New content edit revert rate by user experience",
           caption = "Limited to published new content edits where at least one reference check was shown")  +
    scale_fill_manual(values= c("#999999", "dodgerblue4"), name = "Experiment group")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=24),
        axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        legend.position= "bottom",
        axis.line = element_line(colour = "black")) 
      
p

Results vary slightly based on the type of user completing the edit but none of the observed changes were statistically significant.

By Partner Wikipedia

Code

published_edits_reverted_wiki <- edit_check_publish_data %>%
   filter(was_edit_check_shown == 1 & is_new_content == 1) %>% #limit new content edits
    group_by(wiki, test_group) %>%
    summarise(n_content_edits = n_distinct(editing_session),
             n_reverted_edits = n_distinct(editing_session[was_reverted == 1]))  %>%  #look at reverts
     filter(!wiki %in% c('Afrikaans Wikipedia', 'Igbo Wikipedia', 'Swahili Wikipedia', 'Yoruba Wikipedia')) %>% # remove wikis with inufficient events
     mutate(prop_edits = paste0(round(n_reverted_edits/n_content_edits * 100, 1), "%"))

Code


published_edits_reverted_wiki_table <- published_edits_reverted_wiki %>%
    ungroup() %>%
     mutate(n_content_edits = ifelse(n_content_edits < 50, "<50", n_content_edits),
           n_reverted_edits = ifelse(n_reverted_edits < 50, "<50", n_reverted_edits))  %>% #sanitizing per data publication guidelines 
    group_by(wiki)  %>%
    gt()  %>%
    tab_header(
    title = "New content edit revert rate of edits shown reference check by partner Wikipedia"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment group",
    wiki = "Wikipedia",
    n_content_edits = "Number of new content edits",
    n_reverted_edits = "Number of new content edits reverted",
    prop_edits = "Proportion of new content edits reverted"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits where at least one reference check was shown.
        Excludes wikis where sufficient events were not logged')
    )


display_html(as_raw_html(published_edits_reverted_wiki_table))

New content edit revert rate of edits shown reference check by partner Wikipedia
Experiment group	Number of new content edits	Number of new content edits reverted	Proportion of new content edits reverted
Arabic Wikipedia
control (single check)	444	68	15.3%
test (multiple checks)	365	70	19.2%
Chinese Wikipedia
control (single check)	243	<50	16%
test (multiple checks)	224	<50	18.8%
French Wikipedia
control (single check)	1848	384	20.8%
test (multiple checks)	1769	390	22%
Italian Wikipedia
control (single check)	1419	329	23.2%
test (multiple checks)	1413	354	25.1%
Japanese Wikipedia
control (single check)	543	<50	7%
test (multiple checks)	560	<50	6.8%
Portuguese Wikipedia
control (single check)	474	83	17.5%
test (multiple checks)	468	99	21.2%
Spanish Wikipedia
control (single check)	1299	475	36.6%
test (multiple checks)	1303	457	35.1%
Vietnamese Wikipedia
control (single check)	141	<50	21.3%
test (multiple checks)	139	<50	21.6%
Limited to published new content edits where at least one reference check was shown. Excludes wikis where sufficient events were not logged

Modeling the impact of multi-check on whether a new content edit is reverted

As our second KPI, we also used a Bayesian Hierarchical regression model to correctly infer the impact of offering multi-check on the liklihood of a new content edit being reverted within 48 hours.

Code

#redefine revert status as factor for use in the model
edit_check_publish_data_model$included_new_reference <-
  factor(
    edit_check_publish_data_model$was_reverted,
    levels = c(0, 1)
  )

Code

priors <- c(
  set_prior(prior = "std_normal()", class = "b"),
  set_prior("cauchy(0, 5)", class = "sd")
)

Code

fit_reverts <- brm(
  was_reverted ~ test_group + (1 | wiki/user_id),
  family = bernoulli(link = "logit"),
  data = edit_check_publish_data_model,
  prior = priors,
  chains = 4, cores = 4
)

Code

fit_reverts_tbl <- fit_reverts  %>%
  spread_draws(b_test_grouptestmultiplechecks, b_Intercept) %>%
  mutate(
    exp_b = exp(b_test_grouptestmultiplechecks),
    b4 = b_test_grouptestmultiplechecks/ 4,
    avg_lift =  plogis(b_Intercept + b_test_grouptestmultiplechecks) - plogis(b_Intercept)
  ) %>%
  pivot_longer(
    b_test_grouptestmultiplechecks:avg_lift,
    names_to = "param",
    values_to = "val"
  ) %>%
  group_by(param) %>%
  summarize(
    ps = c(0.025, 0.5, 0.975),
    qs = quantile(val, probs = ps),
    .groups = "drop"
  ) %>%
  mutate(
    quantity = ifelse(
      param %in% c("b_Intercept", "b_test_grouptestmultiplechecks"),
      "Parameter", "Function of parameter(s)"
    ),
    param = factor(
      param,
      c("b_Intercept", "b_test_grouptestmultiplechecks", "exp_b", "b4", "avg_lift"),
      c("(Intercept)", "Multi-Check Available", "Multiplicative effect on odds", "Maximum Lift", "Average lift")
    ),
    ps = factor(ps, c(0.025, 0.5, 0.975), c("lower", "median", "upper")),
  ) %>%
  pivot_wider(names_from = "ps", values_from = "qs") %>%
  arrange(quantity, param)

Code

fit_reverts_tbl %>%
  gt(rowname_col = "param", groupname_col = "quantity") %>%
  row_group_order(c("Parameter", "Function of parameter(s)")) %>%
  fmt_number(c(lower, median, upper), decimals = 3) %>%
  fmt_percent(columns = c(median, lower, upper), rows = 2:3, decimals = 1) %>%
  cols_align("center", c(median, lower, upper)) %>%
  cols_merge(c(lower, upper), pattern = "({1}, {2})") %>%
  cols_move_to_end(c(lower)) %>%
  cols_label(median = "Point Estimate", lower = "95% CI") %>%
  tab_style(cell_text(weight = "bold"), cells_row_groups()) %>%
  tab_footnote("CI: Credible Interval", cells_column_labels(c(lower))) %>%
  tab_footnote(
    html("Average lift = Pr(Success|Multi-Check) - Pr(Success|Single Check) = logit<sup>-1</sup>(&beta;<sub>0</sub> + &beta;<sub>1</sub>) - logit<sup>-1</sup>(&beta;<sub>0</sub>)"),
    cells_body(c(median), 3)
  ) %>%
  tab_footnote(
    html("Maximum lift calculated using the divide-by-4-rule"),
    cells_body(c(median), 2)
  ) %>%
  tab_header("New content edit revert rate: Posterior summary of model parameters")  %>%
      gtsave(
    "fit_reverts_tbl.html", inline_css = TRUE)


IRdisplay::display_html(file = "fit_reverts_tbl.html")

New content edit revert rate: Posterior summary of model parameters
	Point Estimate	95% CI¹
Parameter
(Intercept)	−4.370	(−5.870, −3.056)
Multi-Check Available	0.130	(−0.123, 0.389)
Function of parameter(s)
Multiplicative effect on odds	1.139	(0.885, 1.476)
Maximum Lift	3.3%²	(−3.1%, 9.7%)
Average lift	0.1%³	(−0.2%, 0.9%)
¹ CI: Credible Interval
² Maximum lift calculated using the divide-by-4-rule
³ Average lift = Pr(Success\|Multi-Check) - Pr(Success\|Single Check) = logit^-1(β₀ + β₁) - logit^-1(β₀)

Interpretation of Model Results

Based on estimates from the model, we are not able to confirm the impact of multi-check on the overall revert rate of new content edits.

Key Insights

Overall, there are no significant difference in the revert rate of new content edits between the control and the test group for editing sessions where at least one reference check was shown.
In the test group, there was a -34.7% decrease in revert rate when directly comparing edits presented multiple checks compared to edits presented a single reference check. This decrease is likely in part because the types of edits that warrant multiple reference checks are less likely to be reverted than the types of edits that warrant only a single check.
Revert rate decreases with an increasing number of checks presented. The revert rate of edits presented between 6 to 10 reference check is 11% compared a revert rate of 24% for edits presented a single check. We did not identifiy any significant increases in the revert rate at any number of checks presented.
There were no statistically significant changes in revert rate by platform, user type, or at any of the partner Wikipedias.

Seconday Metric 1: Constructive Activation

Hypothesis: New account holders will be more likely to publish an unreverted edit to the main namespace within 24 hours of creating an account because they will be made aware of the need to accompany new text they’re attempting to publish with a reference, when they don’t first think/know to do so themselves.

For WE 1.2 KR, we defined constructive activation as: “The percentage of newcomers making at least one edit to an article in the main namespace of a Wikipedia project on a mobile device within 24 hours of registration (also on a mobile device) and that edit not being reverted within 48 hours of being published.”

There were 53,137 users that created an account on eiter desktop or mobile web at one of the partner Wikipedias during the AB test timeframe. While, don’t assign a user to a bucket until they begin an edit, we calculated the number of accounts available to be activated within each group as half the total number of users that created an account at one of the partner wikis while the AB test was deployed (based on the 50/50 split used in the AB test).

Code

# load data for assessing activations
all_users_edit_data <-
  read.csv(
    file = 'data/all_users_edit_data_final.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  )

Code

#rename experiment fields to clarify
all_users_edit_data <- all_users_edit_data %>%
  mutate(test_group = factor(test_group,
         levels = c("2025-03-editcheck-multicheck-reference-control", "2025-03-editcheck-multicheck-reference-test"),
         labels = c("control (single check)", "test (multiple checks)")))

Code

# calculate the total number of new account holders for each test group
experiment_group_n_accounts = round(length(unique(all_users_edit_data$user_id)) * 0.50, 0)

Overall by experiment group

Code

# Overall constructive activation notes.
constructive_activation_overall <- all_users_edit_data %>%
    filter(test_group != '') %>%  
    group_by(test_group) %>%
    summarise(num_users = experiment_group_n_accounts,
              num_const_activated = n_distinct(user_id[is_constr_activated == 'is_constr_activated'])) %>%
    mutate(pct_users = paste0(round(num_const_activated/num_users *100, 1), "%")) %>% 
    ungroup()  %>%
    gt()  %>%
    opt_stylize(5) %>%
    tab_header(
    title = "Contructive Activation Rates By Experiment Group"
      )  %>%
  cols_label(
    test_group = "Experiment Group",
    num_users = "Number of newcomers",
    num_const_activated = "Number of users constructively activated", 
    pct_users = "Constructive Activation Rates"
  ) 


display_html(as_raw_html(constructive_activation_overall))

Contructive Activation Rates By Experiment Group
Experiment Group	Number of newcomers	Number of users constructively activated	Constructive Activation Rates
control (single check)	26568	4697	17.7%
test (multiple checks)	26568	4681	17.6%

There were no signficant changes to the overall constructive activation rates.

Reference check is not presented to newcomers until they attempt to save an edit, requiring them to successfully transition through several stages after creating an account before reaching this stage. During this reviewed timeframe, reference check was shown to about 6.7% of all newcomers that created an account.

Constructive edits by newcomers

To help isolate the impact of this intervention on newcomers, we also reviewed changes in overall constructive edit rates. This limits the analysis to newcomers that successfully published an edit where reference check was shown.

For this analysis, we’re defining constructive edits as the proportion of all edits completed by newcomers within 24 hours that are not reverted within 48 hours. This is limited to users that were shown at least once reference check within 24 hours after registering.

Code

# constructive edits
constructive_edits_editcheck <- all_users_edit_data %>%
    filter(num_article_edits_24hrs_editcheck > 0) %>%  #limit to edits where ref check was shown at least once
    group_by(test_group) %>%
    summarise(num_article_edits_total = sum(num_article_edits_24hrs_all),
              num_article_reverts_total = sum(num_article_reverts_24hrs_all)) %>%
    mutate(pct_const = paste0(round((num_article_edits_total-num_article_reverts_total)/num_article_edits_total * 100, 1), "%")) %>%
    gt()  %>%
    opt_stylize(5) %>%
    tab_header(
    title = "Constructive edits completed by newcomers shown Reference Check at least once"
      )  %>%
  cols_label(
    test_group = "Experiment Group",
    num_article_edits_total = "Total number of edits published",
    num_article_reverts_total = "Total number of edits reverted",
    pct_const = "Constructive Edit Rate"
  ) %>%
  tab_footnote(
    footnote = "Defined as the proportion of all published edits that are not reverted within 48 hours",
    locations = cells_column_labels(
      columns = "pct_const"
    )
  ) 

display_html(as_raw_html(constructive_edits_editcheck))

Constructive edits completed by newcomers shown Reference Check at least once
Experiment Group	Total number of edits published	Total number of edits reverted	Constructive Edit Rate¹
control (single check)	65485	17766	72.9%
test (multiple checks)	118100	23407	80.2%
¹ Defined as the proportion of all published edits that are not reverted within 48 hours

We observed a +10% increase in the proportion of constructive edits by users in the test group; however, this change is not statistically signficant.

Note: There is a also a significant increase in the number of edits published by newcomers in the test group. This trend needs to be investigated further to confirm.

Mobile Web Constructive Activation Rates

There were 22996 users that created an account on mobile web at one of the partner Wikipedias during the AB test timeframe.

Code

# load mobile web data for assessing activations
mobile_users_edit_data <-
  read.csv(
    file = 'data/mobile_users_edit_data_final.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  )

Code

#rename experiment fields to clarify
mobile_users_edit_data <- mobile_users_edit_data %>%
  mutate(test_group = factor(test_group,
         levels = c("2025-03-editcheck-multicheck-reference-control", "2025-03-editcheck-multicheck-reference-test"),
         labels = c("control (single check)", "test (multiple checks)")))

Code

# calculate number of account holders for each test group
mobile_experiment_group_n_accounts = round(length(unique(mobile_users_edit_data$user_id)) * 0.50, 0)

Code

# Overall constructive activation notes.
mobile_constructive_activation_overall <- mobile_users_edit_data %>%
    filter(test_group != '') %>%  
    group_by(test_group) %>%
    summarise(num_users = mobile_experiment_group_n_accounts,
              num_const_activated = n_distinct(user_id[is_constr_activated == 'is_constr_activated'])) %>%
    mutate(pct_users = paste0(round(num_const_activated/num_users *100, 1), "%")) %>% 
    ungroup()  %>%
    gt()  %>%
    opt_stylize(5) %>%
    tab_header(
    title = "Mobile Web Contructive Activation Rates By Experiment Group"
      )  %>%
  cols_label(
    test_group = "Test Group",
    num_users = "Number of newcomers",
    num_const_activated = "Number of users constructively activated", 
    pct_users = "Constructive Activation Rates"
  ) 


display_html(as_raw_html(mobile_constructive_activation_overall))

Mobile Web Contructive Activation Rates By Experiment Group
Test Group	Number of newcomers	Number of users constructively activated	Constructive Activation Rates
control (single check)	11498	1813	15.8%
test (multiple checks)	11498	1750	15.2%

There were no signficant changes in constructive activation rates on mobile web.

Desktop Constructive Activation Rates

There were 30141 users that created an account on mobile web at one of the partner Wikipedias during the AB test timeframe.

Code

# load desktop data for assessing activations
desktop_users_edit_data <-
  read.csv(
    file = 'data/desktop_users_edit_data_final.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  )

Code

#rename experiment fields to clarify
desktop_users_edit_data <- desktop_users_edit_data %>%
  mutate(test_group = factor(test_group,
         levels = c("2025-03-editcheck-multicheck-reference-control", "2025-03-editcheck-multicheck-reference-test"),
         labels = c("control (single check)", "test (multiple checks)")))

Code

#calc number of new account holders for each group
desktop_experiment_group_n_accounts = round(length(unique(desktop_users_edit_data$user_id)) * 0.50, 0)

Code

# Overall constructive activation notes.
desktop_constructive_activation_overall <- desktop_users_edit_data %>%
    filter(test_group != '') %>%  
    group_by(test_group) %>%
    summarise(num_users = desktop_experiment_group_n_accounts,
              num_const_activated = n_distinct(user_id[is_constr_activated == 'is_constr_activated'])) %>%
    mutate(pct_users = paste0(round(num_const_activated/num_users *100, 1), "%")) %>% 
    ungroup()  %>%
    gt()  %>%
    opt_stylize(5) %>%
    tab_header(
    title = "Desktop Constructive Activation Rates By Experiment Group"
      )  %>%
  cols_label(
    test_group = "Test Group",
    num_users = "Number of newcomers",
    num_const_activated = "Number of users constructively activated", 
    pct_users = "Constructive Activation Rates"
  ) 


display_html(as_raw_html(desktop_constructive_activation_overall))

Desktop Constructive Activation Rates By Experiment Group
Test Group	Number of newcomers	Number of users constructively activated	Constructive Activation Rates
control (single check)	15070	2744	18.2%
test (multiple checks)	15070	2791	18.5%

There were no signficant changes in constructive activation rates on desktop.

Key Insights

There were no significant changes in constructive activation rates when reviewing overall edits or by platform. Overall, constructive activation rate was 17.7% in the control group and 17.6% in the test group. Note: Activation rates for both the mobile web and desktop seem slightly lower than typical rates observed on each platform during the AB test timeframe. This might be due to the required join to EditAttempStep, which may have caused a loss of some edits that were not instrumented correctly. See T394961.
Reference Check is not presented to newcomers until they attempt to save an edit, requiring them to successfully transition through several stages after creating an account before reaching this stage. During this reviewed timeframe, reference check was shown to about 6% of all newcomers that created an account.
To help isolate the impact of this intervention on newcomers, we also reviewed changes in overall constructive edit rates defined as proportion of edits published by newcomers that are reverted. We observed a +10% increase in the proportion of constructive edits by newcomers in the test group; however, this change was not statistically signficant.

Secondary Metric 2: Increase in the proportion of users that publish at least one new content edit that includes a reference.

Hypothesis: Unregistered users and users with 100 or fewer edits will be more aware of the need to add a reference when contributing new content because the visual editor will prompt them to do so in cases where they have not done so themselves.

Methodology:

This metric is similar to KPI 1 except that it look at proportion of distinct editors versus distinct edits. There were no significant differences to the results reported in KPI 1 as the majority of users posted just one new content edit during the reviewed time period. See overall results below.

Overall by experiment group

Code

published_users_reference_overall <- edit_check_publish_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1) %>% #limit to new content edits where reference check show
    group_by(test_group) %>%
    summarise(n_users = n_distinct(user_id),
              n_users_wref = n_distinct(user_id[included_new_reference == 1])) %>% #limit to new content edits without a refernece
    mutate(prop_users = paste0(round(n_users_wref/n_users * 100, 1), "%")) %>%   
    gt()  %>%
    tab_header(
    title = "Proportion of users that publish at least one new content edit with a reference"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    n_users = "Number of distinct users",
    n_users_wref = "Number of users that include a new reference",
    prop_users = "Proportion of users that include a new reference"
  ) %>%
    tab_source_note(
        gt::md('Limited to users shown reference check and that published at least one new content edit')
    )


display_html(as_raw_html(published_users_reference_overall))

Proportion of users that publish at least one new content edit with a reference
Experiment Group	Number of distinct users	Number of users that include a new reference	Proportion of users that include a new reference
control (single check)	4960	2120	42.7%
test (multiple checks)	4899	2200	44.9%
Limited to users shown reference check and that published at least one new content edit

There was a 5.5% increase in the proportion of distinct users that published a new content edit with a reference when multi-check was available. This increase is statistically significant (p value 0.0151).

By Platform

Code


published_users_reference_byplatform <- edit_check_publish_data %>%
    filter( was_edit_check_shown == 1 & is_new_content == 1) %>% #limit to new content edits where reference check show
    group_by(platform, test_group) %>%
    summarise(n_users = n_distinct(user_id),
              n_users_wref = n_distinct(user_id[included_new_reference == 1])) %>% #limit to new content edits without a refernece
    mutate(prop_users = paste0(round(n_users_wref/n_users * 100, 1), "%"))

Code


dodge <- position_dodge(width=0.9)

p <- published_users_reference_byplatform %>%
    ggplot(aes(x= test_group, y = n_users_wref/n_users, fill = test_group)) +
    geom_col(position = 'dodge') +
    facet_grid(~platform) +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste(prop_users, "\n", n_users_wref,"\n users"), fontface=2), vjust=1.2, size = 7, color = "white") +
    labs (y = "Percent of users ",
          title = "Users that published a new content edit with a reference by platform",
           caption = "Limited to users shown reference check and that published at least one new content edit")  +
    scale_fill_manual(values= c("#999999", "dodgerblue4"), name = "Experiment group")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=20),
        axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        legend.position= "bottom",
        axis.line = element_line(colour = "black")) 
      
p

More distinct users were likely to publish a new content edit with a reference on desktop compared to mobile web. We observed a 6.2% increase in the proportion of distinct users that published a new content edit with a reference compared to a 1% increase on mobile web. The slight increase on mobile web is not statistically signficant.

By User Experience

Code


published_users_reference_byuserexp <- edit_check_publish_data %>%
    filter( was_edit_check_shown == 1 & is_new_content == 1) %>% #limit to new content edits where reference check show
    group_by(experience_level_group, test_group) %>%
    summarise(n_users = n_distinct(user_id),
              n_users_wref = n_distinct(user_id[included_new_reference == 1])) %>% #limit to new content edits without a refernece
    mutate(prop_users = paste0(round(n_users_wref/n_users * 100, 1), "%"))

Code


dodge <- position_dodge(width=0.9)

p <- published_users_reference_byuserexp %>%
    ggplot(aes(x= test_group, y = n_users_wref/n_users, fill = test_group)) +
    geom_col(position = 'dodge') +
    facet_grid(~experience_level_group) +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste(prop_users, "\n", n_users_wref,"\n users"), fontface=2), vjust=1.2, size = 7, color = "white") +
    labs (y = "Percent of users ",
          title = "Users that published a new content edit with a reference by user type",
           caption = "Limited to users shown reference check and that published at least one new content edit")  +
    scale_fill_manual(values= c("#999999", "dodgerblue4"), name = "Experiment group")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=20),
        axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        legend.position= "bottom",
        axis.line = element_line(colour = "black")) 
      
p

By Partner Wikipedia

Code


published_users_reference_bywiki <- edit_check_publish_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1) %>% #limit to new content edits where reference check show
    group_by(wiki, test_group) %>%
    summarise(n_users = n_distinct(user_id),
              n_users_wref = n_distinct(user_id[included_new_reference == 1 & was_reverted == 0])) %>% #limit to new content edits without a refernece
    mutate(prop_users = paste0(round(n_users_wref/n_users * 100, 1), "%")) %>% 
         filter(!wiki %in% c('Afrikaans Wikipedia', 'Igbo Wikipedia', 'Swahili Wikipedia', 'Yoruba Wikipedia'))  # remove wikis with inufficient events

Code


dodge <- position_dodge(width=0.9)

p <- published_users_reference_bywiki %>%
    ggplot(aes(x= test_group, y = n_users_wref/n_users, fill = test_group)) +
    geom_col(position = 'dodge') +
    facet_wrap(~wiki, nrow=2) +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste(prop_users, "\n", n_users_wref,"\n users"), fontface=2), vjust=1.2, size = 7, color = "white") +
    labs (y = "Percent of users ",
           x = "Experiment Group",
          title = "Users that published a new content edit \n with a reference by partner Wikipedia",
           caption = "Limited to users shown reference check and that published at least one new content edit. \n Excludes wikis where sufficient events were not logged")  +
    scale_fill_manual(values= c("#999999", "dodgerblue4"), name = "Experiment group")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=20),
        axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        legend.position= "bottom",
        axis.line = element_line(colour = "black")) 
      
p

Key Insights

There was a 5.5% increase in the proportion of distinct users who published a new content edit with a reference when multi-check was available.
More distinct users were likely to publish a new content edit with a reference on desktop compared to mobile web. We observed a 6.2% increase in the proportion of distinct users that published a new content edit with a reference compared to a 1% increase on mobile web.
Increases were observed in the test group across all user types. Results vary by partner Wikipedia.

Secondary Metric 3a: Constructive Retention Rate

Hypothesis: Newcomers and Junior Contributors will be more likely to return to publish a new content edit in the future that includes a reference because Edit Check will have caused them to realize references are required when contributing new content to Wikipedia.

First we reviewed the proportion of newcomers and Junior Contributors that publish an edit Reference Check was shown and successfully and return to make an unreverted edit to a main namespace. We reviewed the following retention timeframes: returns between 2 to 7 days (7 day retention) and 2 to 30 days (30 day retention).

7 Day Retention Rate

Code

# load  retention
seven_day_retention_rate <-
  read.csv(
    file = 'data/constructive_retention_data_7day_final.tsv',  
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  )

Code


seven_day_retention_overall <- seven_day_retention_rate %>%
    group_by(test_group)  %>%
    summarise(return_editors = sum(return_editors),
              editors = sum(editors),
        retention_rate = paste0(round(return_editors/editors * 100, 1), "%"))

Code

seven_day_retention_overall_table <- seven_day_retention_overall  %>%
  gt()  %>%
  tab_header(
    title = "Constructive seven day retention rate"
  )  %>%
  cols_label(
    test_group = "Experiment group",
    return_editors = "Number of editors that returned second month",
    editors = "Number of first month editors",
    retention_rate = "Retention rate"
  ) %>%
  opt_stylize(5) %>%
  tab_footnote(
    footnote = "Limited to users shown at least one reference check and that made an unreverted edit",
    locations = cells_column_labels(
      columns = 'retention_rate'
    )
  ) 

display_html(as_raw_html(seven_day_retention_overall_table))

Constructive seven day retention rate
Experiment group	Number of editors that returned second month	Number of first month editors	Retention rate¹
control (single check)	170	6445	2.6%
test (multiple checks)	164	6329	2.6%
¹ Limited to users shown at least one reference check and that made an unreverted edit

30 Day Retention Rate

Code

# load  retention
thirty_day_retention_rate <-
  read.csv(
    file = 'data/constructive_retention_data_30day_final.tsv',  
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  )

Code

thirty_day_retention_overall <- thirty_day_retention_rate %>%
    #filter(user_status == 'registered') %>%
    group_by(test_group)  %>%
    summarise(return_editors = sum(return_editors),
              editors = sum(editors),
        retention_rate = paste0(round(return_editors/editors * 100, 1), "%"))

Code

thirty_day_retention_overall_table <- thirty_day_retention_overall  %>%
  gt()  %>%
  tab_header(
    title = "Constructive thirty day retention rate"
  )  %>%
  cols_label(
    test_group = "Experiment group",
    return_editors = "Number of editors that returned second month",
    editors = "Number of first month editors",
    retention_rate = "Retention rate"
  ) %>%
  opt_stylize(5) %>%
  tab_footnote(
    footnote = "Limited to users shown at least one reference check and that made an unreverted edit",
    locations = cells_column_labels(
      columns = 'retention_rate'
    )
  ) 

display_html(as_raw_html(thirty_day_retention_overall_table))

Constructive thirty day retention rate
Experiment group	Number of editors that returned second month	Number of first month editors	Retention rate¹
control (single check)	138	6445	2.1%
test (multiple checks)	129	6329	2%
¹ Limited to users shown at least one reference check and that made an unreverted edit

Secondary Metric 3b: Constructive Retention Rate with Reference Included

We also reviewed proportion of users that publish an edit where referenc check was shown and return to make a new content edit with a reference to a main namespace.

Code

# load  retention
seven_day_retention_rate_wref <-
  read.csv(
    file = 'data/constructive_ref_retention_data_7day_final.tsv',  
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  )

7 Day Retention Rate

Code


seven_day_retention_overall_ref <- seven_day_retention_rate_wref %>%
    group_by(test_group)  %>%
    summarise(return_editors = sum(return_editors),
              editors = sum(editors),
        retention_rate = paste0(round(return_editors/editors * 100, 1), "%"))

Code

seven_day_retention_overall_table_wref <- seven_day_retention_overall_ref  %>%
  gt()  %>%
  tab_header(
    title = "Constructive seven day retention rate with reference included"
  )  %>%
  cols_label(
    test_group = "Experiment group",
    return_editors = "Number of editors that returned second month",
    editors = "Number of first month editors",
    retention_rate = "Retention rate"
  ) %>%
  opt_stylize(5) %>%
  tab_footnote(
    footnote = "Limited to users shown at least one reference check and that made an unreverted edit",
    locations = cells_column_labels(
      columns = 'retention_rate'
    )
  ) 

display_html(as_raw_html(seven_day_retention_overall_table_wref))

Constructive seven day retention rate with reference included
Experiment group	Number of editors that returned second month	Number of first month editors	Retention rate¹
control (single check)	170	6445	2.6%
test (multiple checks)	164	6329	2.6%
¹ Limited to users shown at least one reference check and that made an unreverted edit

30 Day Retention Rate

Code

# load  retention
thirty_day_retention_rate_wref <-
  read.csv(
    file = 'data/constructive_ref_retention_data_30day_final.tsv',  
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  )

Code

thirty_day_retention_overall_wref <- thirty_day_retention_rate_wref %>%
    group_by(test_group)  %>%
    summarise(return_editors = sum(return_editors),
              editors = sum(editors),
        retention_rate = paste0(round(return_editors/editors * 100, 1), "%"))

Code

thirty_day_retention_overall_table_wref <- thirty_day_retention_overall_wref %>%
  gt()  %>%
  tab_header(
    title = "Constructive thirty day retention rate with reference included"
  )  %>%
  cols_label(
    test_group = "Experiment group",
    return_editors = "Number of editors that returned second month",
    editors = "Number of first month editors",
    retention_rate = "Retention rate"
  ) %>%
  opt_stylize(5) %>%
  tab_footnote(
    footnote = "Limited to users shown at least one reference check and that made an unreverted edit",
    locations = cells_column_labels(
      columns = 'retention_rate'
    )
  ) 

display_html(as_raw_html(thirty_day_retention_overall_table_wref))

Constructive thirty day retention rate with reference included
Experiment group	Number of editors that returned second month	Number of first month editors	Retention rate¹
control (single check)	138	6445	2.1%
test (multiple checks)	129	6329	2%
¹ Limited to users shown at least one reference check and that made an unreverted edit

Key Insights

We did not observe any statistically signficant changes in seven or 30 day constructive retention rates during the reviewed timeframe.
To review impacts to metric for future experiments, we could extend experiment durations to obtain a larger sample size and review longer retention timeframes such as second month retention.

Appendix: Guardrails

We reviewed a set of metrics to make sure that the introduction of multi-check was not negatively impacting the user’s editing experience. Identified guardrails include: edit completion rate, user block rate after being shown reference check, and false postivie rates.

Note: We also monitored edit revert rate and confirmed there were no signficant increases in revert rate overall or at any number of reference checks presented. Please see the KPI2 section above for the revert rate results.

Edit Completion Rate

While introducing multiple reference checks introduces extra steps in the publishing workflow that may cause some decrease in edit completion rate, we want to ensure it does not cause significant disruption to contributors.

Methodology: We reviewed the proportion of edits by users that were shown Reference Check during their edit session and successfully published their edit (action = saveSuccess). The analysis is limited to only edits that reached the point where Reference Check was presented at least once after indicating their intent to save (action = saveIntent).

Code

# load data for assessing edit completion rate
edit_completion_rates_data <-
  read.csv(
    file = 'data/edit_completion_rates_data_final.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  )

Code

# Set experience level group and factor levels
edit_completion_rates_data <- edit_completion_rates_data %>%
  mutate(
    experience_level_group = case_when(
     user_edit_count == 0 & user_status == 'registered' ~ 'Newcomer',
     user_edit_count == 0 & user_status == 'unregistered' ~ 'Unregistered',
      user_edit_count > 0 &  user_edit_count <= 100 ~ "Junior Contributor",
      user_edit_count >  100 ~ "Non-Junior Contributor"   
    ),
    experience_level_group = factor(experience_level_group,
         levels = c("Unregistered","Newcomer", "Non-Junior Contributor", "Junior Contributor")
   ))  

#rename experiment field to clarfiy
edit_completion_rates_data <- edit_completion_rates_data %>%
  mutate(test_group = factor(test_group,
         levels = c("2025-03-editcheck-multicheck-reference-control", "2025-03-editcheck-multicheck-reference-test"),
         labels = c("control (single check)", "test (multiple checks)")))

Code

#Set fields and factor levels to assess number of checks shown
#Note limited to 1 sidebar open as we're looking for cases where multiple checks presented in a single sidebar (vs user going back and forth)

edit_completion_rates_data <- edit_completion_rates_data %>%
  mutate(
    multiple_checks_shown = 
         ifelse(n_checks_shown > 1 &  n_sidebar_opens < 2, "multiple checks shown", "one check shown"),  
     multiple_checks_shown = factor( multiple_checks_shown ,
         levels = c("one check shown", "multiple checks shown")))
         
# note these buckets can be adjusted as needed based on distribution of data
edit_completion_rates_data <- edit_completion_rates_data %>%
  mutate(
    checks_shown_bucket = case_when(
     is.na(n_checks_shown) ~ '0',
     n_checks_shown == 1 | (n_checks_shown > 1 & n_sidebar_opens >= 2)  ~ '1', 
     n_checks_shown == 2 & n_sidebar_opens < 2 ~ '2',
     n_checks_shown > 2 & n_checks_shown <= 5 & n_sidebar_opens < 2 ~ "3-5",
     n_checks_shown > 5 & n_checks_shown <= 10 & n_sidebar_opens < 2 ~ "6-10", 
     n_checks_shown > 10 & n_checks_shown <= 15 & n_sidebar_opens < 2 ~ "11-15", 
     n_checks_shown > 15 & n_checks_shown <= 20 & n_sidebar_opens < 2 ~ "16-20", 
    n_checks_shown > 20 & n_sidebar_opens < 2 ~ "over 20" 
    ),
    checks_shown_bucket = factor(checks_shown_bucket ,
         levels = c("0","1","2", "3-5", "6-10","11-15" ,"16-20", "over 20")
   ))

Code

#Remove two abnormal instancse of multiple checks being shown within control group
edit_completion_rates_data <- edit_completion_rates_data %>%
filter(!(test_group == 'control (single check)' & multiple_checks_shown == "multiple checks shown"))

#two abnormal instances of ref checks being shown but no sidebar being logged as opened
edit_completion_rates_data <- edit_completion_rates_data %>%
filter(!(ref_check_shown == 1 & is.na(multiple_checks_shown)))

Overall by experiment group

Code

edit_completion_rate_overall <- edit_completion_rates_data %>%
    filter(ref_check_shown == 1) %>% #limit to sessions where referen check was shown
    group_by(test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>%   
    gt()  %>%
    tab_header(
    title = "Edit completion rate by experiment group"
      )  %>%
opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    n_edits = "Number of edit attempts shown reference check",
    n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
        gt::md('Limited to edit attempts shown at least one reference check')
    )



display_html(as_raw_html(edit_completion_rate_overall ))

Edit completion rate by experiment group
Experiment Group	Number of edit attempts shown reference check	Number of published edits	Proportion of edits saved
control (single check)	11317	8559	75.6%
test (multiple checks)	11244	8372	74.5%
Limited to edit attempts shown at least one reference check

By if multiple checks were shown

Code

edit_completion_rate_bymulti <- edit_completion_rates_data %>%
    filter(ref_check_shown == 1) %>%
    group_by(test_group, multiple_checks_shown) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>%   
    gt()  %>%
    tab_header(
    title = "Edit completion rate by if multiple checks were shown"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment group",
    multiple_checks_shown = "Multiple checks shown",
    n_edits = "Number of edit attempts shown reference check",
    n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
        gt::md('Limited to edit attempts shown at least one reference check')
    )


display_html(as_raw_html(edit_completion_rate_bymulti ))

Edit completion rate by if multiple checks were shown
Multiple checks shown	Number of edit attempts shown reference check	Number of published edits	Proportion of edits saved
control (single check)
one check shown	11317	8559	75.6%
test (multiple checks)
one check shown	8139	6062	74.5%
multiple checks shown	3105	2310	74.4%
Limited to edit attempts shown at least one reference check

By Number of Checks Shown

Code

edit_completion_rate_bynchecks <- edit_completion_rates_data %>%
    filter(ref_check_shown == 1) %>%
    group_by( checks_shown_bucket) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>%  
    ungroup()%>%  
    mutate(n_edits = ifelse(n_edits < 50, "<50", n_edits),
           n_saves = ifelse(n_saves < 50, "<50", n_saves))  %>% #sanitizing per data publication guidelines 
    select(-2) %>% 
    gt()  %>%
    tab_header(
    title = "Edit completion rate by the number of reference checks shown"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    checks_shown_bucket = "Number of checks shown",
    n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
        gt::md('Limited to edits shown at least one reference check')
    )


display_html(as_raw_html(edit_completion_rate_bynchecks ))

Edit completion rate by the number of reference checks shown
Number of checks shown	Number of published edits	Proportion of edits saved
1	14621	75.1%
2	821	80.7%
3-5	844	76.4%
6-10	406	69%
11-15	129	69.7%
16-20	<50	54.3%
over 20	66	50.8%
Limited to edits shown at least one reference check

By Platform

Code

edit_completion_rate_byplatform <- edit_completion_rates_data %>%
    filter(ref_check_shown == 1) %>%
    group_by(platform, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>%   
    gt()  %>%
    tab_header(
    title = "Edit completion rate by experiment group and platform"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    platform = "Platform",
    n_edits = "Number of edit attempts shown reference check",
    n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
        gt::md('Limited to edit attempts shown at least one reference check')
    )


display_html(as_raw_html(edit_completion_rate_byplatform))

Edit completion rate by experiment group and platform
Experiment Group	Number of edit attempts shown reference check	Number of published edits	Proportion of edits saved
Desktop
control (single check)	6542	5239	80.1%
test (multiple checks)	6608	5154	78%
Mobile Web
control (single check)	4775	3320	69.5%
test (multiple checks)	4636	3218	69.4%
Limited to edit attempts shown at least one reference check

By User Experience

Code

edit_completion_rate_byuserstatus <- edit_completion_rates_data %>%
    filter(ref_check_shown == 1) %>%
    group_by(experience_level_group, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>%   
    gt()  %>%
    tab_header(
    title = "Edit completion rate by experiment group and editor experience"
      )  %>%
 opt_stylize(5) %>%
  cols_label(
    test_group = "Test Group",
    experience_level_group = "Experiment Group",
    n_edits = "Number of edit attempts shown reference check",
    n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
        gt::md('Limited to edit attempts shown at least one reference check')
    )


display_html(as_raw_html(edit_completion_rate_byuserstatus ))

Edit completion rate by experiment group and editor experience
Test Group	Number of edit attempts shown reference check	Number of published edits	Proportion of edits saved
Unregistered
control (single check)	6337	4533	71.5%
test (multiple checks)	6303	4424	70.2%
Newcomer
control (single check)	1530	1128	73.7%
test (multiple checks)	1493	1100	73.7%
Junior Contributor
control (single check)	3450	2898	84%
test (multiple checks)	3448	2848	82.6%
Limited to edit attempts shown at least one reference check

By Partner Wikipedia

Code

edit_completion_rate_bywiki <- edit_completion_rates_data %>%
    filter(ref_check_shown == 1) %>%
    group_by(wiki, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>% 
    filter(n_saves >= 100) %>% 
    gt()  %>%
    tab_header(
    title = "Edit completion rate by partner Wikipedia"
      )  %>%
 opt_stylize(5) %>%
  cols_label(
    test_group = "Test Group",
    wiki = "Wikipedia",
    n_edits = "Number of edit attempts shown edit check",
    n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
        gt::md('Limited to wikis with at least 100 published edits')
    )



display_html(as_raw_html(edit_completion_rate_bywiki))

Edit completion rate by partner Wikipedia
Test Group	Number of edit attempts shown edit check	Number of published edits	Proportion of edits saved
Arabic Wikipedia
control (single check)	1029	598	58.1%
test (multiple checks)	903	489	54.2%
Chinese Wikipedia
control (single check)	404	337	83.4%
test (multiple checks)	375	285	76%
French Wikipedia
control (single check)	2966	2449	82.6%
test (multiple checks)	3069	2491	81.2%
Italian Wikipedia
control (single check)	2439	1887	77.4%
test (multiple checks)	2419	1850	76.5%
Japanese Wikipedia
control (single check)	1004	733	73%
test (multiple checks)	940	718	76.4%
Portuguese Wikipedia
control (single check)	894	623	69.7%
test (multiple checks)	881	616	69.9%
Spanish Wikipedia
control (single check)	2329	1726	74.1%
test (multiple checks)	2359	1706	72.3%
Vietnamese Wikipedia
control (single check)	220	177	80.5%
test (multiple checks)	253	177	70%
Limited to wikis with at least 100 published edits

Key Insights

We did not observe any significant decreases in edit completion rate for users presented with multiple reference checks in a session. The edit completion rate for users presented multiple reference checks was 74% compared to 75% for users presented a single reference check.
Edit completion rates stay around 75% or higher for up to 5 checks shown within a single session. After that, edit completion rate decreases to 69% for editing sessions shown between 6 to 15 checks. There were about 200 edit attempts where over 16 reference checks were logged. Further investigation of these edits would help provide insights into the types of edits causing a high number of checks to be presented.
We also did not observe any significant differences in edit completion rate by platform, user experience level, or partner Wikipedia.

False Positive Rate

Methodology:

As an indicator of false postivie rates, we reviewed the proportion of published new content edits that met the following requirements: * People elected to dismiss adding a new reference. This was determined by edits where the user explicitly declined to add a reference at least once in a session (event.feature = 'editCheck-addReference'AND event.action = 'action-reject') * no new reference was included in the final published new content edit (edits with revision tag:editcheck-newreference). * the edit was not reverted within 48 hours.

Note: It’s possible that these edits should be reverted due to lack of citation but were not within 48 hours.

We also reviewed the proportion of edits checks presented (event.feature = 'editCheck-addReference' AND event.action = 'check-shown-presave') that were dismissed by the user to understand the rate of reference check dimissial. This was determined by edits where the user declined to add a reference by explicilty selecting the decline option (event.feature = 'editCheck-addReference'AND event.action = 'action-reject')

Code

# load data for assessing edit reject frequency
edit_check_reject_data <-
  read.csv(
    file = 'data/edit_check_reject_data_final.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  )

Code

# Set experience level group and factor levels
edit_check_reject_data <- edit_check_reject_data %>%
  mutate(
    experience_level_group = case_when(
     user_edit_count == 0 & user_status == 'registered' ~ 'Newcomer',
     user_edit_count == 0 & user_status == 'unregistered' ~ 'Unregistered',
      user_edit_count > 0 &  user_edit_count <= 100 ~ "Junior Contributor",
      user_edit_count >  100 ~ "Non-Junior Contributor"   
    ),
    experience_level_group = factor(experience_level_group,
         levels = c("Unregistered","Newcomer", "Non-Junior Contributor", "Junior Contributor")
   ))  

#rename experiment field to clarify
edit_check_reject_data <- edit_check_reject_data %>%
  mutate(test_group = factor(test_group,
         levels = c("2025-03-editcheck-multicheck-reference-control", "2025-03-editcheck-multicheck-reference-test"),
         labels = c("control (single check)", "test (multiple checks)")))

Code

#Set fields and factor levels to assess number of checks shown
#Note limited to 1 sidebar open as we're looking for cases where multiple checks presented in a single sidebar (vs user going back and forth)

edit_check_reject_data <- edit_check_reject_data %>%
  mutate(
    multiple_checks_shown = 
         ifelse(n_checks_shown > 1 &  n_sidebar_opens < 2, "multiple checks shown", "single check shown"),  
     multiple_checks_shown = factor( multiple_checks_shown ,
         levels = c("single check shown", "multiple checks shown")))
         
# note these buckets can be adjusted as needed based on distribution of data
edit_check_reject_data <- edit_check_reject_data %>%
  mutate(
    checks_shown_bucket = case_when(
     is.na(n_checks_shown) ~ '0',
     n_checks_shown == 1 | (n_checks_shown > 1 & n_sidebar_opens >= 2)  ~ '1', 
     n_checks_shown == 2 & n_sidebar_opens < 2 ~ '2',
     n_checks_shown > 2 & n_checks_shown <= 5 & n_sidebar_opens < 2 ~ "3-5",
     n_checks_shown > 5 & n_checks_shown <= 10 & n_sidebar_opens < 2 ~ "6-10", 
     n_checks_shown > 10 & n_checks_shown <= 15 & n_sidebar_opens < 2 ~ "11-15", 
     n_checks_shown > 15 & n_checks_shown <= 20 & n_sidebar_opens < 2 ~ "16-20", 
    n_checks_shown > 20 & n_sidebar_opens < 2 ~ "over 20" 
    ),
    checks_shown_bucket = factor(checks_shown_bucket ,
         levels = c("0","1","2", "3-5", "6-10","11-15" ,"16-20", "over 20")
   ))

Code

#remove some small occurrences of abnormal data. Will investigate but <0.001% of data at moment so won't impact results.
#Remove one abnormal instance of multiple checks being shown within control group
edit_check_reject_data <- edit_check_reject_data %>%
    filter(!(test_group == 'control (single check)' & multiple_checks_shown == "multiple checks shown"))

# remove one abnormal instance of multiple reject actions being logged with no instances of checks being shown
# Relable n_rejects option
edit_check_reject_data <- edit_check_reject_data %>%
    filter(!(is.na(n_checks_shown) & n_rejects > 0)) %>%
    mutate(n_rejects = ifelse(n_checks_shown > 0 & is.na(n_rejects), 0, n_rejects))


#two abnormal instances of ref checks being shown but no sidebar being logged as opened
edit_check_reject_data <- edit_check_reject_data %>%
    filter(!(was_edit_check_shown == 1 & is.na(multiple_checks_shown)))

Overall by experiment group

Proportion of all reference checks that are dismissed

Code

edit_check_dismissal_overall <- edit_check_reject_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1 ) %>% #limit to where shown
    group_by(test_group) %>%
    summarise(n_checks_shown = sum(n_checks_shown), #Note there are NAs for sessions that don't select. Need to replace with 0
              n_rejects = sum(n_rejects )) %>% #limit to new content edits without a refernece
    mutate(dismissal_rate = paste0(round(n_rejects/n_checks_shown * 100, 1), "%")) %>%   
    gt()  %>%
    opt_stylize(5) %>%
    tab_header(
    title = "Overall Reference Check dismissal rate"
      )  %>%
  cols_label(
    #multiple_checks_shown = "Multiple checks shown",
    n_checks_shown = "Number of checks shown",
    n_rejects = "Number of reference checks dismissed",
    dismissal_rate = "Proportion of reference checks dismissed"
  ) %>%
    tab_source_note(
        gt::md('Limited to published edits')
    )


display_html(as_raw_html(edit_check_dismissal_overall))

Overall Reference Check dismissal rate
test_group	Number of checks shown	Number of reference checks dismissed	Proportion of reference checks dismissed
control (single check)	9804	6496	66.3%
test (multiple checks)	24333	13343	54.8%
Limited to published edits

Proportion of new content edits where no reference is included and are not reverted

Code

# Method 2:

edit_check_fp_overall <- edit_check_reject_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1) %>% #limit to where shown
    group_by(test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0 & included_new_reference == 0 & was_reverted == 0])) %>% #limit to new content edits without a refernece
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>%   
    gt()  %>%
    tab_header(
    title = "Unreverted new content edits where no reference was added"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    n_edits = "Number of edits shown reference check",
    n_rejects = "Number of edits that did not add at least one new reference",
    dismissal_rate = "Proportion of edits where people elected to not add a reference"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits where at least one reference check was shown')
    )


display_html(as_raw_html(edit_check_fp_overall ))

Unreverted new content edits where no reference was added
Experiment Group	Number of edits shown reference check	Number of edits that did not add at least one new reference	Proportion of edits where people elected to not add a reference
control (single check)	8448	3501	41.4%
test (multiple checks)	8258	3160	38.3%
Limited to published new content edits where at least one reference check was shown

By if multiple checks were shown

Proportion of all reference checks that are dismissed

Code

edit_check_dismissal_ifmultiple <- edit_check_reject_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1 ) %>% #limit to where shown
    group_by(test_group, multiple_checks_shown) %>%
    summarise(n_checks_shown = sum(n_checks_shown), #Note there are NAs for sessions that don't select. Need to replace with 0
              n_rejects = sum(n_rejects )) %>% #limit to new content edits without a refernece
    mutate(dismissal_rate = paste0(round(n_rejects/n_checks_shown * 100, 1), "%")) %>%   
    gt()  %>%
    opt_stylize(5) %>%
    tab_header(
    title = "Overall Reference Check dismissal rate by if multiple checks were shown"
      )  %>%
  cols_label(
    #multiple_checks_shown = "Multiple checks shown",
    n_checks_shown = "Number of checks shown",
    n_rejects = "Number of reference checks dismissed",
    dismissal_rate = "Proportion of reference checks dismissed"
  ) %>%
    tab_source_note(
        gt::md('Limited to published edits')
    )


display_html(as_raw_html(edit_check_dismissal_ifmultiple))

Overall Reference Check dismissal rate by if multiple checks were shown
multiple_checks_shown	Number of checks shown	Number of reference checks dismissed	Proportion of reference checks dismissed
control (single check)
single check shown	9804	6496	66.3%
test (multiple checks)
single check shown	11737	5754	49%
multiple checks shown	12596	7589	60.2%
Limited to published edits

Proportion of new content edits where no reference is included and are not reverted

Code

edit_check_fp_bymultiple <- edit_check_reject_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1) %>% #limit to where shown
    group_by(test_group,multiple_checks_shown) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0 & included_new_reference == 0 & was_reverted == 0])) %>% #limit to new content edits without a refernece
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>%   
    gt()  %>%
    tab_header(
    title = "Unreverted new content edits where no reference was added by if multiple checks were shown"
      )  %>%
 opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
     multiple_checks_shown = "Multiple Checks",
      n_edits = "Number of edits shown reference check",
    n_rejects = "Number of edits that did not add at least one new reference",
    dismissal_rate = "Proportion of edits where people elected to not add a reference"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits where at least one reference check was shown')
    )


display_html(as_raw_html(edit_check_fp_bymultiple ))

Unreverted new content edits without a reference by if multiple checks were shown
Multiple Checks	Number of edits shown reference check	Number of edits that did not add at least one new reference	Proportion of edits where people elected to not add a reference
control (single check)
single check shown	8448	3501	41.4%
test (multiple checks)
single check shown	5988	2388	39.9%
multiple checks shown	2270	772	34%
Limited to published new content edits where at least one reference check was shown

By number of Reference Checks Shown

Proportion of all reference checks that are dismissed

Code

edit_check_dismissal_nchecks <- edit_check_reject_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1 ) %>% #limit to where shown
    group_by(checks_shown_bucket) %>%
    summarise(n_checks_shown = sum(n_checks_shown), #Note there are NAs for sessions that don't select. Need to replace with 0
              n_rejects = sum(n_rejects )) %>% #limit to new content edits without a refernece
    mutate(dismissal_rate = paste0(round(n_rejects/n_checks_shown * 100, 1), "%")) %>%   
    gt()  %>%
    opt_stylize(5) %>%
    tab_header(
    title = "Reference Check dismissal rate by number of checks shown"
      )  %>%
  cols_label(
    #multiple_checks_shown = "Multiple checks shown",
    n_checks_shown = "Number of checks shown",
    n_rejects = "Number of reference checks dismissed",
    dismissal_rate = "Proportion of reference checks dismissed"
  ) %>%
    tab_source_note(
        gt::md('Limited to published edits')
    )


display_html(as_raw_html(edit_check_dismissal_nchecks ))

Overall Reference Check dismissal rate by number of checks shown
checks_shown_bucket	Number of checks shown	Number of reference checks dismissed	Proportion of reference checks dismissed
1	21541	12250	56.9%
2	1618	1013	62.6%
3-5	3108	1858	59.8%
6-10	2946	1546	52.5%
11-15	1598	916	57.3%
16-20	788	479	60.8%
over 20	2538	1777	70%
Limited to published edits

Proportion of new content edits where no reference is included and are not reverted

Code


edit_check_fp_bynchecks <- edit_check_reject_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1 & n_sidebar_opens < 2 ) %>% #limit to where shown
    group_by( checks_shown_bucket) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0 & included_new_reference == 0 & was_reverted == 0])) %>% #limit to new content edits without a refernece
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>% 
    ungroup() %>%
      mutate(n_edits = ifelse(n_edits < 50, "<50", n_edits),
           n_rejects = ifelse(n_rejects < 50, "<50", n_rejects))  %>% #sanitizing per data publication guidelines
    gt()  %>%
    tab_header(
    title = "Unreverted new content edits where no reference was added by number of checks shown"
      )  %>%
 opt_stylize(5) %>%
  cols_label(
    checks_shown_bucket = "Number of reference checks shown",
    n_edits = "Number of edits shown reference check",
    n_rejects = "Number of edits that did not add at least one new reference",
    dismissal_rate = "Proportion of edits where people elected to not add a reference"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits where at lest one reference check was shown')
    )

display_html(as_raw_html(edit_check_fp_bynchecks))

Unreverted new content edits without a reference by number of checks shown
Number of reference checks shown	Number of edits shown reference check	Number of edits that did not add at least one new reference	Proportion of edits where people elected to not add a reference
1	12924	5393	41.7%
2	809	299	37%
3-5	831	278	33.5%
6-10	396	133	33.6%
11-15	125	<50	24%
16-20	<50	<50	29.5%
over 20	65	<50	29.2%
Limited to published new content edits where at lest one reference check was shown

By Platform

Proportion of all reference checks that are dismissed

Code

edit_check_dismissal_byplatform <- edit_check_reject_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1 ) %>% #limit to where shown
    group_by(platform, test_group) %>%
    summarise(n_checks_shown = sum(n_checks_shown), #Note there are NAs for sessions that don't select. Need to replace with 0
              n_rejects = sum(n_rejects )) %>% #limit to new content edits without a refernece
    mutate(dismissal_rate = paste0(round(n_rejects/n_checks_shown * 100, 1), "%")) %>%   
    gt()  %>%
    opt_stylize(5) %>%
    tab_header(
    title = "Reference Check dismissal rate by platform"
      )  %>%
  cols_label(
    #multiple_checks_shown = "Multiple checks shown",
    n_checks_shown = "Number of checks shown",
    n_rejects = "Number of reference checks dismissed",
    dismissal_rate = "Proportion of reference checks dismissed"
  ) %>%
    tab_source_note(
        gt::md('Limited to published edits')
    )


display_html(as_raw_html(edit_check_dismissal_byplatform ))

Reference Check dismissal rate by platform
test_group	Number of checks shown	Number of reference checks dismissed	Proportion of reference checks dismissed
desktop
control (single check)	6068	3866	63.7%
test (multiple checks)	17191	9660	56.2%
phone
control (single check)	3736	2630	70.4%
test (multiple checks)	7142	3683	51.6%
Limited to published edits

Proportion of new content edits where no reference is included and are not reverted

Code

edit_check_fp_byplatform <- edit_check_reject_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1) %>% #limit to where shown
    group_by(platform,test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0 & included_new_reference == 0 & was_reverted == 0])) %>% #limit to new content edits without a refernece
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>%   
    gt()  %>%
    tab_header(
    title = "Unreverted new content edits where no reference was added by number of checks shown"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    platform = "Platform",
    n_edits = "Number of edits shown reference check",
    n_rejects = "Number of edits that did not add at least one new reference",
    dismissal_rate = "Proportion of edits where people elected to not add a reference"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits where at least one reference check was shown')
    )

display_html(as_raw_html(edit_check_fp_byplatform ))

Unreverted new content edits where no reference was added by number of checks shown
Experiment Group	Number of edits shown reference check	Number of edits that did not add at least one new reference	Proportion of edits where people elected to not add a reference
desktop
control (single check)	5165	2085	40.4%
test (multiple checks)	5075	1884	37.1%
phone
control (single check)	3283	1416	43.1%
test (multiple checks)	3183	1276	40.1%
Limited to published new content edits where at least one reference check was shown

By User Experience

Proportion of new content edits where no reference is included and are not reverted

Code

edit_check_fp_byuserstatus <- edit_check_reject_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1) %>% #limit to where shown
    group_by(experience_level_group, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0 & included_new_reference == 0 & was_reverted == 0])) %>% #limit to new content edits without a refernece
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>%   
    gt()  %>%
    tab_header(
    title = "Unreverted new content edits where no reference was added by user experience"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    experience_level_group = "User Status",
   n_edits = "Number of edits shown reference check",
    n_rejects = "Number of edits that did not add at least one new reference",
    dismissal_rate = "Proportion of edits where people elected to not add a reference"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits where at least one reference check shown')
    )


display_html(as_raw_html(edit_check_fp_byuserstatus))

Unreverted new content edits without a reference by user experience
Experiment Group	Number of edits shown reference check	Number of edits that did not add at least one new reference	Proportion of edits where people elected to not add a reference
Unregistered
control (single check)	4498	2058	45.8%
test (multiple checks)	4385	1877	42.8%
Newcomer
control (single check)	1104	382	34.6%
test (multiple checks)	1081	350	32.4%
Junior Contributor
control (single check)	2846	1061	37.3%
test (multiple checks)	2792	933	33.4%
Limited to published new content edits where at least one reference check shown

By Partner Wikipedia

Proportion of new content edits where no reference is included and are not reverted

Code

edit_check_dismissal_bywiki <- edit_check_reject_data %>%
    filter(was_edit_check_shown == 1 & is_new_content == 1) %>% #limit to where shown
    group_by(wiki, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0 & included_new_reference == 0 & was_reverted == 0])) %>% #limit to new content edits without a refernece
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>% 
        filter(!wiki %in% c('Afrikaans Wikipedia', 'Igbo Wikipedia', 'Swahili Wikipedia', 'Yoruba Wikipedia'))%>%   # remove wikis with inufficient event
    filter(n_rejects > 65) %>%  #remove wikis with too few edits
    gt()  %>%
    tab_header(
    title = "Unreverted new content edits where no reference was added by partner Wikipedia"
      )  %>%
 opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    wiki = "Wikipedia",
    n_edits = "Number of edits shown reference check",
    n_rejects = "Number of edits that did not add at least one new reference",
    dismissal_rate = "Proportion of edits where people elected to not add a reference"
  ) %>%
    tab_source_note(
        gt::md('Limited to published edits where at least one reference check show. Excludes wikis where insufficient events were logged')
    )


display_html(as_raw_html(edit_check_dismissal_bywiki))

Unreverted new content edits where no reference was added by partner Wikipedia
Experiment Group	Number of edits shown reference check	Number of edits that did not add at least one new reference	Proportion of edits where people elected to not add a reference
Arabic Wikipedia
control (single check)	586	214	36.5%
test (multiple checks)	474	140	29.5%
Chinese Wikipedia
control (single check)	333	150	45%
test (multiple checks)	279	125	44.8%
French Wikipedia
control (single check)	2417	1050	43.4%
test (multiple checks)	2469	955	38.7%
Italian Wikipedia
control (single check)	1875	861	45.9%
test (multiple checks)	1834	813	44.3%
Japanese Wikipedia
control (single check)	725	378	52.1%
test (multiple checks)	703	344	48.9%
Portuguese Wikipedia
control (single check)	609	155	25.5%
test (multiple checks)	603	163	27%
Spanish Wikipedia
control (single check)	1698	605	35.6%
test (multiple checks)	1682	528	31.4%
Vietnamese Wikipedia
control (single check)	176	70	39.8%
test (multiple checks)	175	75	42.9%
Limited to published edits where at least one reference check show. Excludes wikis where insufficient events were logged

Key Insights

We did not identify any increases in false positive rate.
- In the test group, users declined adding a reference to an edit that was not reverted at 38% of new content edits compared to 41% of edits in the control group. Additionally, there was no increase in the proportion of reference checks dismissed by users who published a new content edit. 66% of reference checks in the control group were dismissed compared to 55% in the test group.
- When limited to edits presented multiple checks, there is a higher increase in the proportion of individual checks dismissed. In the test group, 60% of reference checks presented in multi-check sessions were dismissed compared to about 50% presented a single check. However, we observed a decrease in the proportion of unreverted new content edits where users declined to add a reference for users shown multiple checks.
There were no significant increases in false positive rates or reference check dismissal rates by user type, platform, or partner Wikipedia.

Block Rates

Methodology: We gathered all edits where edit check was shown from the mediawiki_revision_change_tag table and joined with mediawiki_private_cu_changes to gather user name info. We then reviewed both global and local blocks made within 6 hours of an edit published where reference check was shown as identified in the logging table.

Note: At the time of this analysis, May block data was unavailable for dates so analysis is limited to blocks that occured between 25 March 2025 through 31 April 2025.

Code

# load data for assessing blocks
edit_check_blocks <-
  read.csv(
    file = 'data/edit_check_eligible_users_blocked.csv',
    header = TRUE,
    sep = ",",
    stringsAsFactors = FALSE
  )

Code

#rename experiment field to clarify
edit_check_blocks <- edit_check_blocks%>%
  mutate(test_group = factor(bucket,
         levels = c("2025-03-editcheck-multicheck-reference-control", "2025-03-editcheck-multicheck-reference-test"),
         labels = c("control (single check)", "test (multiple checks)")))

Code

edit_check_local_blocks_overall <- edit_check_blocks %>%
    group_by(test_group) %>%
    summarise(blocked_users = n_distinct(cuc_ip[is_local_blocked == 'True' | is_global_blocked == 'True']),
              all_users = n_distinct(cuc_ip))  %>%  #look at blocks
    mutate(prop_blocks = paste0(round(blocked_users/all_users * 100, 1), "%")) %>%
    select(-c(2,3)) %>% #removing granular data columns 
    gt()  %>%
    tab_header(
    title = "Proportion of users blocked by experiment group"
      )  %>%
  opt_stylize(5) %>%
  cols_label(
    test_group = "Test Group",
    prop_blocks = "Proportion of users blocked"
  )  %>%
    tab_source_note(
        gt::md('Limited to users blocked 6 hours after publishing an edit where reference check was shown')
    )


display_html(as_raw_html(edit_check_local_blocks_overall))

Proportion of users blocked by experiment group
Test Group	Proportion of users blocked
control (single check)	2.9%
test (multiple checks)	3.7%
Limited to users blocked 6 hours after publishing an edit where reference check was shown

Key Insights

3.3% of users were blocked after publishing an edit where at least one reference check was shown. By experiment group, 3.7% of users were blocked in the test group compared to 3% in the control group. This difference is not statistically significant and limited to edits by unregistered users in each group.
No global blocks were issued to any users that published an edit where at least one reference check was shown.

Footnotes

Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2021. Regression and other stories. https://doi.org/10.1017/9781139161879.↩︎

Reuse

CC BY-SA 4.0