R-bloggers https://www.r-bloggers.com R news and tutorials contributed by hundreds of R bloggers Thu, 16 Apr 2026 23:59:00 +0000 en-US hourly 1 https://wordpress.org/?v=5.5.18 https://i0.wp.com/www.r-bloggers.com/wp-content/uploads/2016/08/cropped-R_single_01-200.png?fit=32%2C32&ssl=1 R-bloggers https://www.r-bloggers.com 32 32 11524731 What’s new in R 4.6.0? https://www.r-bloggers.com/2026/04/whats-new-in-r-4-6-0/ Thu, 16 Apr 2026 23:59:00 +0000 https://www.jumpingrivers.com/blog/whats-new-r46/

R 4.6.0 (“Because it was There”) is set for release on April 24th 2026. Here we summarise some of
the more interesting changes that have been introduced. In previous blog posts, we have discussed
the new features introduced in
R 4.5.0
and earlier versions (see the links at the end of this post).
...

Continue reading: What’s new in R 4.6.0?]]>
[social4i size="small" align="align-left"] -->
[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R 4.6.0 (“Because it was There”) is set for release on April 24th 2026. Here we summarise some of the more interesting changes that have been introduced. In previous blog posts, we have discussed the new features introduced in R 4.5.0 and earlier versions (see the links at the end of this post).

Once R 4.6.0 is released, the full changelog will be available at the r-release ‘NEWS’ page. If you want to keep up to date with developments in base R, have a look at the r-devel ‘NEWS’ page.

! (values %in% collection)

Code should be readable, and easily understood. And although it isn’t a natural language, there’s something off about code that reads like:

If not a blog post is readable, I close the browser tab.

To check if one (or more) value is in some collection, R has the %in% operator:

"a" %in% letters
[1] TRUE

"a" in LETTERS
[1] FALSE

This is different from the in keyword, which you use when iterating over a collection:

for (x in letters[1:3]) {
 message(x)
}
# a
# b
# c

Sometimes you want to know whether a value is absent from a collection. The standard way to do this is to invert results from %in%:

! "a" %in% LETTERS
[1] TRUE

It’s unambiguous to the R interpreter. But it can be hard to read and understand – on scanning that statement, you might forget that ! acts after the %in%. As such, we often wrap the %in% expression with parentheses to make the code more clear:

! ("a" %in% LETTERS)
[1] TRUE

For the sake of clarity, many developers have implemented their own absence-checking operator. Writing a custom operator in R uses similar syntax to that used when writing a function:

`%NOTIN%` = function(x, y) {
 ! (x %in% y)
}

"a" %NOTIN% LETTERS
[1] TRUE

Were you to write the same code multiple times in the same project, you would write a function. Similarly, if you (or your team) wrote the same function in multiple files or projects, you might add it to a package and import it. So if lots of package developers have implemented the same operator or function, across their CRAN packages, maybe it should be pushed to a higher plane…

That is what has happened with the introduction of %notin% in R 4.6.0. An operator that was found across lots of separate packages has been moved up into base R:

"a" %notin% LETTERS
[1] TRUE

"a" %notin% letters
[1] FALSE

DOI citations

If you use R in your publications or your projects, you may need to provide a citation for it. rOpenSci has a blog post about citing R and R packages – why, when and how to do it.

For the R project as a whole, there is a simple function citation() that provides the information you need:

citation()

To cite R in publications use:

 R Core Team (2026). _R: A Language and Environment for Statistical
 Computing_. R Foundation for Statistical Computing, Vienna, Austria.
 doi:10.32614/R.manuals <https://doi.org/10.32614/R.manuals>.
 <https://www.R-project.org/>.

A BibTeX entry for LaTeX users is

 @Manual{,
 ...
 }
...

In R 4.6.0, a DOI (Digital Object Identifier) has been added to the citation entry to make it easier to reference R in your published work.

summary(character_vector, character.method = "factor")

str() and summary() are two of the first functions I reach for when exploring a dataset. For a data-frame, these tell me

  • what type of columns are present (str(): chr, num, Date, …); and
  • what is present in each column (summary(): gives the min, max, mean of each numeric column, for example).

summary() works with data-structures other than just data-frames.

For factors, summary() tells you how many observations of the factor levels were observed:

# 'species' and 'island' are factor columns in `penguins`
summary(penguins[1:3])

 species island bill_len
 Adelie :152 Biscoe :168 Min. :32.10
 Chinstrap: 68 Dream :124 1st Qu.:39.23
 Gentoo :124 Torgersen: 52 Median :44.45
 Mean :43.92
 3rd Qu.:48.50
 Max. :59.60
 NA's :2

For character columns, the output from summary() has been a little obtuse:

# R 4.5.0

# 'studyName' and 'Species' are character columns in `penguins_raw`
summary(penguins_raw[1:3])

 studyName Sample Number Species
 Length:344 Min. : 1.00 Length:344
 Class :character 1st Qu.: 29.00 Class :character
 Mode :character Median : 58.00 Mode :character
 Mean : 63.15
 3rd Qu.: 95.25
 Max. :152.00

R 4.6.0 adds a neater way to summarise character vectors/columns to summary():

# R 4.6.0
summary(penguins_raw[1:3])

 studyName Sample Number Species
 Length :344 Min. : 1.00 Length :344
 N.unique : 3 1st Qu.: 29.00 N.unique : 3
 N.blank : 0 Median : 58.00 N.blank : 0
 Min.nchar: 7 Mean : 63.15 Min.nchar: 33
 Max.nchar: 7 3rd Qu.: 95.25 Max.nchar: 41
 Max. :152.00

We can also summarise character vectors/columns as if they were factors:

# R 4.6.0
summary(penguins_raw[1:3], character.method = "factor")

 studyName Sample Number Species
 PAL0708:110 Min. : 1.00 Adelie Penguin (Pygoscelis adeliae) :152
 PAL0809:114 1st Qu.: 29.00 Chinstrap penguin (Pygoscelis antarctica): 68
 PAL0910:120 Median : 58.00 Gentoo penguin (Pygoscelis papua) :124
 Mean : 63.15
 3rd Qu.: 95.25
 Max. :152.00

list.files(..., fixed = TRUE)

Suppose we have three files in my working directory: abc.Rmd, CONTRIBUTING.md and README.md. If I want to obtain the filenames for the “.md” files from an R script, I can list those files that match a pattern:

# R 4.5.0
list.files(pattern = ".md")
[1] "abc.Rmd" "CONTRIBUTING.md" "README.md"

Hmmm.

In the pattern, . actually matches any character. In R 4.5.0, if I want to match the . character explicitly, I can escape it in the pattern. But pattern matching can lead to complicated code in R, because some characters are treated specially by the pattern matcher, and some are treated specially by R’s string parser.

To tell R to find files with a literal ‘.md’ in the filename, we escape the . character twice: once for the . (to tell the pattern matcher to match a ., rather than any character), and once to escape the \ (to tell R that the subsequent \ is really a backslash)

# R 4.5.0
list.files(pattern = "\\.md")
[1] "CONTRIBUTING.md" "README.md"

R 4.0.x cleaned that up a bit, we can now use ‘raw strings’ in R. Everything between the parentheses in the next pattern is passed directly to the pattern matcher:

# R 4.5.0
list.files(pattern = r"(\.md)")

Now, in R 4.6.0, we can indicate to list.files() (and the synonym dir()) that our pattern is a fixed string (rather than a regular expression). With this, we don’t need to escape the . character to match the “.md” suffix.

list.files(pattern = ".md", fixed = TRUE)
[1] "CONTRIBUTING.md" "README.md"

Other matters

  • read.dcf() now allows comment lines, which means you can annotate your config files (e.g., for {lintr}).
  • df |> plot(col2 ~ col1) can now be used for base plotting; this is a little neater for exploratory work than df |> plot(col2 ~ col1, data = _)
  • C++20 is now the default C++ standard

Trying out R 4.6.0

To take away the pain of installing the latest development version of R, you can use docker. To use the devel version of R, you can use the following commands:

docker pull rstudio/r-base:devel-jammy
docker run --rm -it rstudio/r-base:devel-jammy

Once R 4.6 is the released version of R and the r-docker repository has been updated, you should use the following command to test out R 4.6.

docker pull rstudio/r-base:4.6-jammy
docker run --rm -it rstudio/r-base:4.6-jammy

An alternative way to install multiple versions of R on the same machine is using rig.

See also

The R 4.x versions have introduced a wealth of interesting changes. These have been summarised in our earlier blog posts:

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: What’s new in R 4.6.0?]]>
400593
My Domain: proteome-wide scanning of TMDs https://www.r-bloggers.com/2026/04/my-domain-proteome-wide-scanning-of-tmds/ Thu, 16 Apr 2026 12:04:00 +0000 https://quantixed.org/?p=3753 I wanted to know: After a little bit of searching, I couldn’t find any answers. So I decided to use R to retrieve the necessary info from Uniprot and calculate it myself. I thought I’d post it here in case it’s useful for others. Human We’ll ...
Continue reading: My Domain: proteome-wide scanning of TMDs]]>
[social4i size="small" align="align-left"] -->
[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I wanted to know:

  1. how many proteins in the human proteome have transmembrane domains?
  2. of those that do, how many have 1 or 2 or n transmembrane domains?

After a little bit of searching, I couldn’t find any answers. So I decided to use R to retrieve the necessary info from Uniprot and calculate it myself. I thought I’d post it here in case it’s useful for others.

Human

We’ll start with the info I wanted. According to Uniprot there are 20,659 proteins in the human proteome. One quarter of these have one or more TMD. The majority have one TMD and there are almost 1,000 7TM proteins (all those GPCRs, I guess). There’s 413 4TM and 327 2TM proteins. We can find examples of 1 through 17 TMDs, there’s no proteins with 18, 4 proteins with 19TM, 21 with 24TM and 2 with 38TM.

The analysis is done simply by looking at how many TRANSMEM Uniprot has for each IDs in the reference proteome. I have not distinguished between helical and partial entries, and of course it’s possible that the annotations are not quite correct.

TMDsCountFrequency (proteome as %)Frequency (TMDPs as %)
1240211.645.8
23271.66.2
31590.83.0
44132.07.9
5770.41.5
62761.35.3
79474.618.0
8830.41.6
9630.31.2
101230.62.3
11750.41.4
122021.03.8
13240.10.5
14260.10.5
15130.10.2
1610.00.0
1790.00.2
1940.00.1
24210.10.4
3820.00.0

Having written this code, I decided to run some other proteomes to see how they compare.

Model organisms

These four organisms have between 18% and 29% of the proteome made of proteins with TMDs. The pattern of numbers of TMDs are kind of similar although there’s no peak in yeast for 7TM and the peaks for 2, 4, 6 or 12 TMs differ from human.

Maybe this information is out there in some database or other. As I said, I couldn’t find something easily. Even if there are more precise ways of determining the TMDs, I think this data is good enough to know roughly what the proportions are.

The code

I manually downloaded the fasta.gz files for the reference proteomes. They are currently linked here.

To extract all the Uniprot IDs, I used a shell one-liner:

awk '/^>sp\|.*\|/{gsub(/^>sp\|/,""); gsub(/\|.*/,""); print ">" $0; next} {print}' file.fasta | grep "^>" | sed 's/>//g' > species_uniprot.txt

Then I used this R script. The main function can probably be simplified. I had to add several checks to make sure I got all the data back from the API. Before posting, I tried to cut it back to make it easier to read, but Ionly succeeded in breaking the script! This is the working version.

# if (!require("BiocManager", quietly = TRUE)) {
#   install.packages("BiocManager") 
# }
# BiocManager::install("biomaRt")
library(httr)
library(stringr)
library(ggplot2)
library(biomaRt)
library(dplyr)
library(tidyr)
library(cowplot)

## FUNCTIONS ----

isJobReady <- function(jobId, pollingInterval = 5, maxWaitSeconds = 3600) {
  if (is.null(jobId) || length(jobId) == 0 || is.na(jobId) || !nzchar(jobId)) {
    return(FALSE)
  }
  nTries <- ceiling(maxWaitSeconds / pollingInterval)
  for (i in 1:nTries) {
    url <- paste("https://rest.uniprot.org/idmapping/status/", jobId, sep = "")
    r <- GET(url = url, accept_json())
    status <- content(r, as = "parsed")
    if (!is.null(status[["results"]]) || !is.null(status[["failedIds"]])) {
      return(TRUE)
    }
    if (!is.null(status[["messages"]])) {
      print(status[["messages"]])
      return(FALSE)
    }
    Sys.sleep(pollingInterval)
  }
  return(FALSE)
}

retrieveUniprotInfo <- function(x,
                                chunk_size = 5000,
                                maxWaitSeconds = 3600,
                                taxId = "9606",
                                progress = TRUE) {
  normalize_uniprot_ids <- function(values) {
    values <- trimws(values)
    # Accept FASTA-style headers like: sp|P12345|... or tr|A0A...|...
    m <- str_match(values, regex("^>?\\s*(?:sp|tr)\\|([^|]+)\\|", ignore_case = TRUE))
    values <- ifelse(!is.na(m[, 2]), m[, 2], values)
    values
  }
  
  ids <- unique(normalize_uniprot_ids(x))
  ids <- ids[!is.na(ids) & nzchar(ids)]
  if (length(ids) == 0) {
    stop("No valid identifiers were provided to retrieveUniprotInfo().")
  }
  
  fields <- "accession,id,protein_name,gene_names,ft_transmem,length,cc_function,cc_subcellular_location,go_p,go_c"
  acc_pattern <- "^[OPQ][0-9][A-Z0-9]{3}[0-9](-[0-9]+)?$|^[A-NR-Z][0-9](?:[A-Z][A-Z0-9]{2}[0-9]){1,2}(-[0-9]+)?$"
  is_accession <- str_detect(ids, acc_pattern)
  
  split_into_chunks <- function(values, chunk_size = chunk_size) {
    split(values, ceiling(seq_along(values) / chunk_size))
  }
  
  get_next_link <- function(link_header) {
    if (is.null(link_header)) {
      return(NULL)
    }
    links <- unlist(strsplit(link_header, ",\\s*"))
    next_link <- links[str_detect(links, "rel=\\\"next\\\"")]
    if (length(next_link) == 0) {
      return(NULL)
    }
    next_url <- str_extract(next_link[1], "(?<=<).+?(?=>)")
    if (is.na(next_url) || !nzchar(next_url)) {
      return(NULL)
    }
    next_url
  }
  
  read_tsv_response <- function(resp) {
    read.table(
      text = content(resp, as = "text", encoding = "UTF-8"),
      sep = "\t",
      header = TRUE,
      fill = TRUE,
      quote = "",
      comment.char = "",
      check.names = FALSE
    )
  }
  
  fetch_from_redirect <- function(redirect_url) {
    if (is.null(redirect_url) || length(redirect_url) == 0 ||
        is.na(redirect_url) || !nzchar(redirect_url)) {
      return(NULL)
    }
    
    # The paged idmapping results endpoint is capped at size <= 500.
    # Use stream endpoint to retrieve the full chunk in one response.
    stream_url <- gsub("/results/", "/results/stream/", redirect_url)
    sep <- ifelse(str_detect(stream_url, "\\?"), "&", "?")
    stream_url <- paste0(
      stream_url,
      sep,
      "fields=", URLencode(fields, reserved = TRUE),
      "&format=tsv"
    )
    
    r <- GET(url = stream_url)
    if (status_code(r) < 400) {
      return(read_tsv_response(r))
    }
    
    # Fallback to paged endpoint if stream is unavailable.
    sep <- ifelse(str_detect(redirect_url, "\\?"), "&", "?")
    url <- paste0(
      redirect_url,
      sep,
      "fields=", URLencode(fields, reserved = TRUE),
      "&format=tsv&size=500"
    )
    
    r <- GET(url = url)
    stop_for_status(r)
    resultsTable <- read_tsv_response(r)
    
    next_url <- get_next_link(headers(r)[["link"]])
    while (!is.null(next_url) && !is.na(next_url) && nzchar(next_url)) {
      r <- GET(url = next_url)
      stop_for_status(r)
      resultsTable <- rbind(resultsTable, read_tsv_response(r))
      next_url <- get_next_link(headers(r)[["link"]])
    }
    
    resultsTable
  }
  
  map_ids <- function(values, from_db, to_db, chunk_size, taxId = NULL,
                      label = "ids") {
    if (length(values) == 0) {
      return(NULL)
    }
    
    results_list <- list()
    chunks <- split_into_chunks(values, chunk_size = chunk_size)
    n_chunks <- length(chunks)
    for (i in seq_along(chunks)) {
      chunk <- chunks[[i]]
      if (isTRUE(progress)) {
        cat(sprintf("[UniProt] %s chunk %d/%d (%d ids)\n",
                    label, i, n_chunks, length(chunk)))
      }
      
      files <- list(
        ids = paste0(chunk, collapse = ","),
        from = from_db,
        to = to_db
      )
      if (!is.null(taxId)) {
        files$taxId <- taxId
      }
      
      r <- POST(url = "https://rest.uniprot.org/idmapping/run", body = files,
                encode = "multipart", accept_json())
      stop_for_status(r)
      submission <- content(r, as = "parsed", encoding = "UTF-8")
      
      job_id <- submission[["jobId"]]
      if (is.null(job_id) || length(job_id) == 0 || is.na(job_id) || !nzchar(job_id)) {
        if (isTRUE(progress)) {
          cat(sprintf("[UniProt] %s chunk %d/%d: no jobId returned\n",
                      label, i, n_chunks))
        }
        next
      }
      if (!isJobReady(job_id, maxWaitSeconds = maxWaitSeconds)) {
        if (isTRUE(progress)) {
          cat(sprintf("[UniProt] %s chunk %d/%d: timeout/not ready\n",
                      label, i, n_chunks))
        }
        next
      }
      
      details_url <- paste("https://rest.uniprot.org/idmapping/details/",
                           job_id, sep = "")
      r <- GET(url = details_url, accept_json())
      stop_for_status(r)
      details <- content(r, as = "parsed", encoding = "UTF-8")
      
      redirect_url <- details[["redirectURL"]]
      if (is.null(redirect_url) || length(redirect_url) == 0 ||
          is.na(redirect_url) || !nzchar(redirect_url)) {
        if (isTRUE(progress)) {
          cat(sprintf("[UniProt] %s chunk %d/%d: missing redirectURL\n",
                      label, i, n_chunks))
        }
        next
      }
      chunk_result <- fetch_from_redirect(redirect_url)
      if (is.null(chunk_result)) {
        if (isTRUE(progress)) {
          cat(sprintf("[UniProt] %s chunk %d/%d: invalid redirectURL\n",
                      label, i, n_chunks))
        }
        next
      }
      
      results_list[[length(results_list) + 1]] <- chunk_result
      if (isTRUE(progress)) {
        cat(sprintf("[UniProt] %s chunk %d/%d: completed\n",
                    label, i, n_chunks))
      }
    }
    
    if (length(results_list) == 0) {
      return(NULL)
    }
    do.call(rbind, results_list)
  }
  
  accession_ids <- ids[is_accession]
  gene_like_ids <- ids[!is_accession]
  
  acc_results <- map_ids(
    values = accession_ids,
    from_db = "UniProtKB_AC-ID",
    to_db = "UniProtKB",
    chunk_size = chunk_size,
    label = "accessions"
  )
  
  gene_results <- map_ids(
    values = gene_like_ids,
    from_db = "Gene_Name",
    to_db = "UniProtKB-Swiss-Prot",
    chunk_size = chunk_size,
    taxId = taxId,
    label = "gene_names"
  )
  
  results_list <- list()
  if (!is.null(acc_results)) {
    results_list[[length(results_list) + 1]] <- acc_results
  }
  if (!is.null(gene_results)) {
    results_list[[length(results_list) + 1]] <- gene_results
  }
  
  if (length(results_list) == 0) {
    stop("No UniProt results were returned. Check identifiers and taxId.")
  }
  
  resultsTable <- do.call(rbind, results_list)
  if ("Entry" %in% colnames(resultsTable)) {
    resultsTable <- resultsTable[!duplicated(resultsTable$Entry), ]
  }
  return(resultsTable)
}



## SCRIPT ----

species <- c("human", "zebrafish", "drosophila", "worm", "yeast")
sci_names <- c("human" = "Homo sapiens", "zebrafish" = "Danio rerio", "drosophila" = "Drosophila melanogaster",
               "worm" = "Caenorhabditis elegans", "yeast" = "Saccharomyces cerevisiae")

for (org in species) {
  # look up scientific name of org
  sci_name <- sci_names[org]
  output_path <- paste0("Output/Data/", org, "_uniprot.csv")
  if(file.exists(output_path)) {
    message(paste("File", output_path, "already exists. Loading", org))
    df <- read.csv(output_path)
  } else {
    message(paste("Retrieving UniProt info for", org))
    species_ids <- read.delim(paste0("Data/", org, "_uniprot.txt"), header = FALSE)
    names(species_ids) <- c("uniprot_id")
    df <- retrieveUniprotInfo(species_ids$uniprot_id)
    # save this result
    write.csv(df, output_path, row.names = FALSE)
  }
  
  df$tms <- str_count(df$Transmembrane, "TRANSMEM")
  tm_counts <- df %>%
    group_by(tms) %>%
    summarise(count = n()) %>% 
    filter(tms > 0)
  
  p1 <- ggplot(tm_counts, aes(x = tms, y = count)) +
    geom_col(fill = "#009988") +
    labs(x = "Transmembrane domains", y = "Count",
         title = sci_name) +
    lims(x = c(0.5,NA), y = c(0,NA)) +
    theme_bw(10)
  
  p2 <- SuperPlotR::pieplot(x1 = c(sum(df$tms == 0), sum(df$tms > 0)),
                            cols = c("#bbbbbb", "#009988")) +
    # blank background and no legend
    theme_void() +
    theme(legend.position = "none")
  
  # inset p2 in p1 and add information about the percentages
  p <- ggdraw() +
    draw_plot(p1) +
    # top right
    draw_plot(p2, x = 0.9, y = 0.9, hjust = 1, vjust = 1, width = 0.4, height = 0.4) +
    draw_label(paste0("Total Proteins: ", nrow(df),
                      "\nNo TM: ", round(sum(df$tms == 0) / nrow(df) * 100, 1),
                      "%\nWith TM(s): ",
                      round(sum(df$tms > 0) / nrow(df) * 100, 1), "%"),
               x = 0.97, y = 0.85, hjust = 1, vjust = 1, size = 8)
  plot_path <- paste0("Output/Plots/", org, "_uniprot_tm_counts.png")
  ggsave(plot_path, p, width = 1600, height = 900, units = "px", dpi = 300)
}

Note that I am using {cowplot} at the end to inset the pie chart and to add the text onto the main plot.

The post title comes from “My Domain” by Bernard Butler from his People Move On album.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: My Domain: proteome-wide scanning of TMDs]]>
400597
Expanding the Editorial Team: Alec Robitaille and Lucy D’Agostino McGowan Join as Editors https://www.r-bloggers.com/2026/04/expanding-the-editorial-team-alec-robitaille-and-lucy-dagostino-mcgowan-join-as-editors/ Thu, 16 Apr 2026 00:00:00 +0000 https://ropensci.org/blog/2026/04/16/editors2026/ At rOpenSci, we’re continually grateful for the support and engagement of our community, who help make research open-source stronger, more inclusive, and more collaborative. The software peer review program continues to grow, and today we announ...

Continue reading: Expanding the Editorial Team: Alec Robitaille and Lucy D’Agostino McGowan Join as Editors]]>
[social4i size="small" align="align-left"] -->
[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

At rOpenSci, we’re continually grateful for the support and engagement of our community, who help make research open-source stronger, more inclusive, and more collaborative. The software peer review program continues to grow, and today we announce that our editorial team keeps expanding:

We’re excited to welcome Alec Robitaille and Lucy D’Agostino McGowan as new editors. Alec joins our general review team, and Lucy our statistical software review team. Their expertise and dedication will help sustain and strengthen software peer review, ensuring that software reviews continue to meet the highest standards of quality, transparency, and impact.

Meet our new editors!

Alec Robitaille

headshot of Alec Robitaille

Alec is a graduate student at Memorial University of Newfoundland and Labrador (Canada) studying foraging ecology, habitat selection and social networks in caribou and other ungulates. He has been involved in many projects along the way, from measuring muskrat habitat and lake ice dynamics at McGill University to estimating drought sensitivity in Canada’s forests with the Canadian Forestry Service. Passionate about teaching open science and programming, he regularly mentors peers, runs workshops, develops example repositories, and organizes a Bayesian stats colearning group. He maintains the package spatsoc (reviewed by rOpenSci in 2018) and has developed smaller packages with diverse applications including remote sensing (irg), social networks (hwig), camera trap monitoring (camtrapmonitoring), and animal movement (distanceto). He reviewed the ohun, chopin, and emodnet.wfs packages for rOpenSci, guest edited for the rredlist package and is currently handling the reviews for the ActiGlobe, and saperlipopette packages.

Alec on GitHub, Website.

I first connected with the rOpenSci community through the review process for the package spatsoc in 2018 as part of our manuscript submission at Methods in Ecology and Evolution. During the review, I was thoroughly impressed by how welcoming the community was, and how effective the process was in helping me learn how to improve the package. rOpenSci is a landmark in the R and open science ecosystems with an ever evolving community to learn from and to be a part of. I am very grateful to be given the opportunity to continue contributing to rOpenSci in this new role as editor.

Alec L. Robitaille

Lucy D’Agostino McGowan

headshot of Lucy D'Agostino McGowan

Lucy D’Agostino McGowan is an associate professor in the Department of Statistical Sciences at Wake Forest University. She received her PhD in Biostatistics from Vanderbilt University and completed her postdoctoral training at Johns Hopkins University Bloomberg School of Public Health. Her research focuses on causal inference, statistical communication, analytic design theory, and data science pedagogy. Lucy can be found blogging at livefreeordichotomize.com, on Blue Sky @LucyStats.bsky.social, and podcasting on Casual Inference.

Lucy on GitHub, Website.

I am so thrilled to join the rOpenSci editorial team! I love the rOpenSci community and mission and am grateful for the opportunity to contribute.

Lucy D’Agostino McGowan

About the Software Peer Review Program

rOpenSci’s software peer review program brings together volunteers to collaboratively review scientific and statistical software according to transparent, constructive, and open standards. Editors manage submissions, coordinate reviewers, and help guide packages through review to improve code quality, documentation, and usability.

This program is possible thanks to the many community members: authors submitting their packages, reviewers volunteering their time and expertise, and editors like Alec and Lucy who help managing reviews and maintaining a supportive process.

Get Involved

Are you considering submitting your package for review? These resources will help:

Would you like to review packages? Fill out the rOpenSci Reviewer Sign-Up Form to volunteer to review.

Welcome again Alec and Lucy! We’re thrilled to have you join the editorial team.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Expanding the Editorial Team: Alec Robitaille and Lucy D’Agostino McGowan Join as Editors]]>
400595
Stage II OSCC — Health Economics Model https://www.r-bloggers.com/2026/04/stage-ii-oscc-health-economics-model/ Thu, 16 Apr 2026 00:00:00 +0000 https://rworks.dev/posts/oscc-patient-model/

Most health care economics models are constructed from the perspective of a managed health care system such as those offered in Canada and several European countries, or from the perspective of some other third party such as an insurance company...

Continue reading: Stage II OSCC — Health Economics Model]]>
[social4i size="small" align="align-left"] -->
[This article was first published on R Works, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Most health care economics models are constructed from the perspective of a managed health care system such as those offered in Canada and several European countries, or from the perspective of some other third party such as an insurance company. Although the benefits of constructing models from the patient’s perspective have been discussed in the literature, see for example Ioannidis & Garber (2011) and Tai et al. (2016), few models have been published. Obvious reasons for this situation would include the privacy issues associated with obtaining patient specific data and the apparent lack of economic incentives to commit resources required to abstract individual patient trajectories from patient level data.

This post explores the hypothesis that detailed, data-driven, patient-specific models may not be necessary in order to achieve most of the benefits of a health economics model from the patients perspective. It present a lightweight, cohort model that captures the majority of the benefits to both physicians and patients that might result from a patient focused model.

The model is intended to be a proof of concept, a minimal viable model that may be useful to physicians both in deciding on treatment options and in informing patients about the potential outcomes and how they may experience these outcomes. To see how the medical literature may be helpful for this kind of modeling I explore the specific case of choosing between surgery and definitive radiation therapy for patients with stage II oral squamous cell carcinoma (OSCC).

Note that the model developed here is not being presented as a medical analysis. No medical experts have been consulted in its construction. It is merely being offered as an example of what could be done to build a health care economics model from a patient’s perspective with today’s modeling tools. Also note that I made significant use of Posit Assistant for the RStudio IDE configured with Claude Sonnet 4.6 which I found to be immensely helpful both for code construction and literature searches even though utility of the latter use was limited by the vexing proclivity of Claude to hallucinate.

Overview of the Model

The recovery of patients undergoing treatment OSCC, either surgery or radiation treatment, is conceived as a stochastic journey through various health states. The states visited, the sequence in which they are visited and the length of stay in each health state are modeled as random variables developing in in the framework of an eight=state, continuous-time Markov chain. Estimates of the transition probabilities among states and the mean time patients would remain in a state drive Markov chain. In the model below, these estimates are derived from the medical literature, however, it is conceivable that clinicians may feel comfortable in making their own estimates, or modifying the literature derived estimates according to their experience and their evaluations of their patients.

The next step uses the theory of continuous time Markov chains (CTMC) to construct a synthetic data set for a cohort of patients who vary in age and tumor size. The key insight here is that the synthetic data set is the underlying statistical model. Empirical survival curves and the time spent in each state and other useful quantities are then calculated from the synthetic data. The patient specific healthcare model is then constructed by estimating the utility of each health state to the patient. This is accomplished by constructing quality adjust life years (QALYs) based on EQ-5D values reported in the literature.

State Diagram for a Continuous Time Markov Chain

Packages used throughout the post
library(ggplot2)
library(grid)
library(dplyr)
library(msm)
library(gt)

It is my understanding that although most patients wit stage II OSCC are treated with surgery there are times when surgery is either not possible or not preferable and the in these exceptional cases, radiation treatment is the primary alternative. The following state diagram is an attempt to abstract both the health states a patient will experience following treatment and the probable paths that will be taken among them. Eight states seemed to me to be the minimum number of states required to represent the complexities of treatment.

Show code to build state diagram
# library(ggplot2)
# library(grid)

BG <- "#F0F3F7"

# ── 1. Nodes ─────────────────────────────────────────────────────────────────
# Layout: left=initial treatment, center=adjuvant/NED, right=outcomes/death
nodes <- data.frame(
  id    = 1:8,
  label = c(
    "S1\nSurgery\n± Neck Dissection",
    "S2\nDefinitive\nRadiation",
    "S3\nPost-op\nSurveillance",
    "S4\nAdjuvant RT\n(PORT)",
    "S5\nAdjuvant\nChemoRT (POCRT)",
    "S6\nNED\n(Health)",
    "S7\nLocoregional\nRecurrence",
    "S8\nDeath"
  ),
  x    = c(1.5,  1.5,  4.5,  4.5,  4.5,  7.0,  9.5,  11.5),
  y    = c(8.5,  3.0,  9.5,  6.5,  3.5,  6.5,  7.5,   5.5),
  fill = c("#E67E22","#8E44AD","#F39C12","#2980B9","#16A085",
           "#27AE60","#C0392B","#2C3E50"),
  stringsAsFactors = FALSE
)

# ── 2. Edge definitions: from, to, label, curvature, nudge_x, nudge_y ────────
edge_defs <- list(
  # Surgery → downstream
  c(1, 3,  "No high-risk\nfeatures",    0.00,  0.10,  0.30),
  c(1, 4,  "High-risk:\nmargins/PNI/\nLVI/nodes",  0.10, -0.20,  0.20),
  c(1, 5,  "ECE/pos\nmargins",          0.18,  0.00, -0.30),
  c(1, 8,  "Peri-op\ndeath",           -0.22,  0.10, -0.20),
  # Definitive RT → downstream
  c(2, 6,  "NED",                       0.18,  0.00,  0.25),
  c(2, 7,  "Treatment\nfailure",        0.00,  0.00,  0.25),
  c(2, 8,  "Death",                     0.14, -0.10, -0.20),
  # Post-op Surveillance → downstream
  c(3, 6,  "",                          0.00,  0.10,  0.25),
  c(3, 7,  "",                          0.22,  0.20, -0.10),
  c(3, 8,  "",                         -0.18,  0.10, -0.18),
  # PORT → downstream
  c(4, 6,  "",                          0.00,  0.10,  0.25),
  c(4, 7,  "",                          0.12,  0.10,  0.18),
  c(4, 8,  "",                          0.14,  0.10, -0.18),
  # POCRT → downstream
  c(5, 6,  "",                          0.00,  0.10,  0.25),
  c(5, 7,  "",                          0.00,  0.10,  0.22),
  c(5, 8,  "",                          0.12,  0.10, -0.18),
  # NED → outcomes
  c(6, 7,  "LR recurrence",             0.00, -0.10,  0.25),
  c(6, 8,  "Death",                     0.18,  0.10, -0.14),
  # Locoregional Recurrence → outcomes
  c(7, 8,  "Death",                     0.00,  0.00,  0.25),
  c(7, 6,  "Salvage\nsuccess",         -0.28, -0.28,  0.00)
)

edges <- do.call(rbind, lapply(edge_defs, function(r) {
  data.frame(from=as.integer(r[[1]]), to=as.integer(r[[2]]),
             prob=r[[3]], curvature=as.numeric(r[[4]]),
             nudge_x=as.numeric(r[[5]]), nudge_y=as.numeric(r[[6]]),
             stringsAsFactors=FALSE)
}))

# Attach node coordinates
edges <- merge(edges, nodes[, c("id","x","y")], by.x="from", by.y="id", sort=FALSE)
names(edges)[names(edges)=="x"] <- "x0"; names(edges)[names(edges)=="y"] <- "y0"
edges <- merge(edges, nodes[, c("id","x","y","fill")], by.x="to", by.y="id", sort=FALSE)
names(edges)[names(edges)=="x"] <- "x1"; names(edges)[names(edges)=="y"] <- "y1"
names(edges)[names(edges)=="fill"] <- "dest_col"

# Shorten endpoints to node perimeter
NODE_R <- 0.58
shorten <- function(x0,y0,x1,y1,r) {
  dx<-x1-x0; dy<-y1-y0; d<-sqrt(dx^2+dy^2)
  list(xs=x0+dx*r/d, ys=y0+dy*r/d, xe=x1-dx*r/d, ye=y1-dy*r/d)
}
segs     <- Map(shorten, edges$x0, edges$y0, edges$x1, edges$y1, MoreArgs=list(r=NODE_R))
edges$xs <- sapply(segs,`[[`,"xs"); edges$ys <- sapply(segs,`[[`,"ys")
edges$xe <- sapply(segs,`[[`,"xe"); edges$ye <- sapply(segs,`[[`,"ye")
edges$lx <- (edges$xs+edges$xe)/2 + edges$nudge_x
edges$ly <- (edges$ys+edges$ye)/2 + edges$nudge_y

# ── 3. Build layers ───────────────────────────────────────────────────────────
arrow_layers <- lapply(seq_len(nrow(edges)), function(i) {
  e <- edges[i,,drop=FALSE]
  geom_curve(data=e, mapping=aes(x=xs,y=ys,xend=xe,yend=ye),
             curvature=e$curvature[[1]], colour=e$dest_col[[1]],
             linewidth=1.0, alpha=0.80,
             arrow=arrow(length=unit(9,"pt"), type="closed", ends="last"),
             show.legend=FALSE)
})

label_layers <- lapply(seq_len(nrow(edges)), function(i) {
  e <- edges[i,,drop=FALSE]
  if (nchar(trimws(e$prob[[1]])) == 0) return(NULL)
  geom_label(data=e, mapping=aes(x=lx,y=ly,label=prob),
             colour=e$dest_col[[1]], fill=BG, size=2.6, fontface="bold",
             label.size=0.22, label.r=unit(0.10,"lines"),
             label.padding=unit(0.15,"lines"), show.legend=FALSE)
})
label_layers <- Filter(Negate(is.null), label_layers)

# ── 4. Base plot ──────────────────────────────────────────────────────────────
base_plot <- ggplot() +
  theme_void(base_size=10) +
  theme(
    plot.background  = element_rect(fill=BG, colour=NA),
    panel.background = element_rect(fill=BG, colour=NA),
    plot.title  = element_text(family="serif", face="bold", size=16,
                               colour="#1A2535", hjust=0.5, margin=margin(b=4)),
    plot.subtitle = element_text(size=9, colour="#555555", hjust=0.5,
                                 margin=margin(b=10))
  ) +
  coord_fixed(xlim=c(0, 13), ylim=c(1.5, 11), clip="off") +
  labs(title="Stage II Oral Squamous Cell Carcinoma",
       subtitle="CTMC State Transition Diagram")

# ── 5. Assemble ───────────────────────────────────────────────────────────────
all_layers <- c(arrow_layers, label_layers,
  list(
    # Column labels
    annotate("text", x=1.5,  y=10.8, label="Initial\nTreatment",
             size=3.2, colour="#555", fontface="italic", hjust=0.5),
    annotate("text", x=4.5,  y=10.8, label="Post-surgical\nPathway",
             size=3.2, colour="#555", fontface="italic", hjust=0.5),
    annotate("text", x=7.0,  y=10.8, label="No Evidence\nof Disease",
             size=3.2, colour="#555", fontface="italic", hjust=0.5),
    annotate("text", x=9.5,  y=10.8, label="Disease\nProgression",
             size=3.2, colour="#555", fontface="italic", hjust=0.5),
    annotate("text", x=11.5, y=10.8, label="Absorbing\nState",
             size=3.2, colour="#555", fontface="italic", hjust=0.5),
    # Nodes
    geom_point(data=nodes, aes(x=x, y=y, colour=fill),
               size=28, show.legend=FALSE),
    scale_colour_identity(),
    geom_text(data=nodes, aes(x=x, y=y, label=label),
              colour="white", size=2.8, fontface="bold", lineheight=0.9),
    # Surgical candidate bracket
    annotate("segment", x=0.25, xend=0.25, y=2.2, yend=9.3,
             colour="#888", linewidth=0.5),
    annotate("segment", x=0.25, xend=0.45, y=9.3, yend=9.3,
             colour="#888", linewidth=0.5),
    annotate("segment", x=0.25, xend=0.45, y=2.2, yend=2.2,
             colour="#888", linewidth=0.5),
    annotate("text", x=0.0, y=5.75, label="Surgical\ncandidate?\nYes ↑\nNo ↓",
             size=2.8, colour="#666", hjust=0.5, lineheight=0.9)
  )
)

Reduce("+", all_layers, init=base_plot)

State Label Role
S1 Surgery Initial surgical treatment (days–weeks)
S2 DefinitiveRT Non-surgical candidates; 6–7 week RT course
S3 PostOpSurveillance Surgery, no high-risk pathological features
S4 AdjuvantRT (PORT) High-risk: margins, PNI, LVI, multiple nodes
S5 AdjuvantChemoRT (POCRT) Highest-risk: ECE or positive margins
S6 NED No evidence of disease; long-term surveillance
S7 LR Recurrence Locoregional recurrence at primary or nodes
S8 Death Absorbing state

Synthetic Data Simulation

This section builds a synthetic patient cohort for Stage II (T2 N0 M0) oral squamous cell carcinoma using a continuous-time Markov chain (CTMC). The first code block sets up of the simulation by defining the covariates: patient age and tumor size. Mean patient age is set at 62 years old. Tumor size ranges from tumor 2-4 cm, DOI 5-10 mm

Simulation setup
set.seed(6413)

N_PATIENTS <- 1000
MAX_TIME   <- 60      # months (5-year follow-up)
obs_base   <- c(0, 1, 2, 3, 6, 9, 12, 18, 24, 36, 48, 60)

state_labels_ii <- c(
  "Surgery", "DefinitiveRT", "PostOpSurveillance",
  "AdjuvantRT_PORT", "AdjuvantChemoRT_POCRT",
  "NED", "LR_Recurrence", "Death"
)

# ── Patient covariates ────────────────────────────────────────────────────────
# Stage II OSCC (T2 N0 M0): tumor 2-4 cm, DOI 5-10 mm
age           <- as.integer(round(rnorm(N_PATIENTS, mean = 62, sd = 11)))
age           <- pmax(32L, pmin(85L, age))
tumor_size_cm <- round(runif(N_PATIENTS, 2.0, 4.0), 1)
DOI_mm        <- round(runif(N_PATIENTS, 5.0, 10.0), 1)

# Surgical candidacy: logistic model, ~78% surgery at age 60
surgery_prob <- plogis(1.5 - 0.04 * (age - 60))
treatment    <- ifelse(runif(N_PATIENTS) < surgery_prob, "Surgery", "DefinitiveRT")

Clinician Inputs

The simulation is driven by two sets of clinician-interpret able inputs that are entered at the beginning of this section of code:

  • Jump chain probabilities jump_P[i,j]: given that a patient leaves state i, the probability they transition directly to state j.
  • Mean sojourn times mean_sojourn[i]: the average time (months) a patient spends in state i before transitioning.

The jump chain may be thought of as an embedded discrete time Markov chain that gives the probabilities of which state the chain will jump to next when it is time to jump. The sojourn times are inverted to obtain the transition rates. The values for these parameters have been derived from the literature, however, the model has been set up so that they can be easily changed if the physician thinks that a particular class of patients would be better represented by other values. The justifications for the default values for the parameters driving the model are provided in the comments in the code and in the references below.

Show code for model inputs
# == CLINICIAN INPUTS ==========================================================
#
# jump_P[i, j]  : probability of moving to state j *given* a transition occurs
#                 from state i.  Rows must sum to 1 for states 1-7.
#
# mean_sojourn  : average time (months) in each transient state.
#
# Together these fully specify the CTMC generator:
#   Q[i, j] = jump_P[i, j] / mean_sojourn[i]   (off-diagonal)
# All downstream simulation code derives Q_base automatically from these inputs.

# ── Jump chain transition probabilities ───────────────────────────────────────


jump_P <- matrix(0, 8, 8,
  dimnames = list(paste0("S", 1:8), paste0("S", 1:8)))

# S1 (Surgery) -> {S3=no high-risk, S4=PORT, S5=POCRT, S8=peri-op death}
# ~20-30% high-risk features -> PORT (Tassone et al. 2023, NCDB n=53,503; DOI:10.1002/ohn.205)
# ~10-15% ECE/pos margins -> POCRT; peri-operative mortality ~1-2% (Nathan et al. 2025, NSQIP n=866; DOI:10.18203/issn.2454-5929.ijohns20252980)
# NB: S8 revised from 0.05 to 0.015 — the original 5% exceeded the literature range of 1-2%;
# freed mass (0.035) reallocated to S3 (residual no-adjuvant group), S4 and S5 unchanged.
jump_P[1, c(3, 4, 5, 8)] <- c(0.585, 0.25, 0.15, 0.015)

# S2 (DefinitiveRT) -> {S6=NED, S7=treatment failure, S8=death}
# Definitive RT alone locoregional control ~65% for T2N0M0 (Dana-Farber Group 2011 PMID 21531515, 2-yr LRC 64%; Studer et al. 2007 T2-4 LC ~50-60%)
jump_P[2, c(6, 7, 8)] <- c(0.65, 0.28, 0.07)

# S3 (PostOpSurveillance) -> {S6=NED, S7=LR recurrence, S8=death}
# 5-yr LR recurrence ~12-18% after margin-negative R0 surgery alone (Ord 2006 15.5%; Luryi 2014 NCDB early-stage series)
jump_P[3, c(6, 7, 8)] <- c(0.74, 0.18, 0.08)

# S4 (PORT) -> {S6=NED, S7=LR recurrence, S8=death}
# After PORT, 3-yr OS ~71% (95% CI 67-75%) in adjuvant PORT cohort (Hosni et al. 2019, PMID: 30244160)
jump_P[4, c(6, 7, 8)] <- c(0.67, 0.23, 0.10)

# S5 (POCRT) -> {S6=NED, S7=LR recurrence, S8=death}
# After POCRT, locoregional control ~60%; Stage II mortality revised down to 10% (Bernier 2004, Cooper 2004; Stage II subgroup)
jump_P[5, c(6, 7, 8)] <- c(0.60, 0.30, 0.10)

# S6 (NED) -> {S7=LR recurrence, S8=death}
# ~75% leave NED via recurrence vs other-cause death (SEER Stage II split)
jump_P[6, c(7, 8)] <- c(0.75, 0.25)

# S7 (LR Recurrence) -> {S6=salvage success, S8=death}
# Salvage success 0.43 / mortality 0.57: Lee et al. 2024 pooled 5-yr OS = 43% (n=2069); revised from c(0.25, 0.75) — previous value over-estimated post-recurrence mortality
jump_P[7, c(6, 8)] <- c(0.43, 0.57)

# ── Mean sojourn times (months) ───────────────────────────────────────────────
mean_sojourn <- c(
  1.5,   # S1  Surgery: weighted mean accounting for real-world S-PORT delays ~8-9 wks (Correia 2026; Dayan 2023; Graboyes 2017)
  3.0,   # S2  DefinitiveRT: RT course (1.5 mo) + post-RT response assessment window >=12 wks (NCCN HNC Guidelines; Mehanna et al. 2016 [PET-NECK] PMID 26958921)
  22.0,  # S3  PostOpSurveillance: weighted mean to event (NED@24mo x0.74 + recur@18mo x0.18 + death@14mo x0.08) = 22 mo (Blatt 2022; Brands 2019; Ord 2006)
  3.0,   # S4  PORT: ~6-week adjuvant RT + recovery — supported, unchanged
  4.0,   # S5  POCRT: ~6-week concurrent chemoRT + extended recovery — supported, unchanged
  120.0, # S6  NED: calibrated to SEER ~20% 5-yr recurrence; exit rate 0.0083/mo -> mean 120 mo (SEER localized OCC; Brands 2019; Luryi 2014)
  12.5,  # S7  LR Recurrence: median OS ~12-18 mo all-comers (Liu 2007; Contrera 2022) — within supported range, unchanged
  NA     # S8  Death (absorbing)
)
names(mean_sojourn) <- paste0("S", 1:8)

# == END CLINICIAN INPUTS ======================================================

# ── Derive Q_base from clinician inputs ───────────────────────────────────────
# Q[i,j] = jump_P[i,j] / mean_sojourn[i]
Q_base <- matrix(0, 8, 8)
for (i in 1:7) {
  for (j in which(jump_P[i, ] > 0)) {
    Q_base[i, j] <- jump_P[i, j] / mean_sojourn[i]
  }
}
diag(Q_base) <- -rowSums(Q_base)
Jump Chain Transition Probabilities
Show code
as.data.frame(jump_P)
   S1 S2    S3   S4   S5   S6   S7    S8
S1  0  0 0.585 0.25 0.15 0.00 0.00 0.015
S2  0  0 0.000 0.00 0.00 0.65 0.28 0.070
S3  0  0 0.000 0.00 0.00 0.74 0.18 0.080
S4  0  0 0.000 0.00 0.00 0.67 0.23 0.100
S5  0  0 0.000 0.00 0.00 0.60 0.30 0.100
S6  0  0 0.000 0.00 0.00 0.00 0.75 0.250
S7  0  0 0.000 0.00 0.00 0.43 0.00 0.570
S8  0  0 0.000 0.00 0.00 0.00 0.00 0.000
Mean Sojourn Times and Implied Rates
Show code
data.frame(
  #State        = paste0("S", 1:7),
  Label        = state_labels_ii[1:7],
  Mean_sojourn = mean_sojourn[1:7],
  Total_rate   = round(-diag(Q_base)[1:7], 4)
)
                   Label Mean_sojourn Total_rate
S1               Surgery          1.5     0.6667
S2          DefinitiveRT          3.0     0.3333
S3    PostOpSurveillance         22.0     0.0455
S4       AdjuvantRT_PORT          3.0     0.3333
S5 AdjuvantChemoRT_POCRT          4.0     0.2500
S6                   NED        120.0     0.0083
S7         LR_Recurrence         12.5     0.0800

Synthetic Data

This code block constructs the synthetic data and prints out the first few rows of the synthetic data set.

show code
# ── Patient-specific Q (covariates scale recurrence & mortality rates) ─────────
make_Q <- function(age_i, size_i) {
  Q <- Q_base
  age_eff  <- exp(0.025 * (age_i  - 60))   # older  -> higher mortality
  size_eff <- exp(0.180 * (size_i -  3.0))  # larger -> higher recurrence
  Q[3, 7] <- Q_base[3, 7] * size_eff        # PostOpSurv -> LR Recurrence
  Q[6, 7] <- Q_base[6, 7] * size_eff * age_eff
  Q[6, 8] <- Q_base[6, 8] * age_eff
  Q[7, 8] <- Q_base[7, 8] * age_eff
  diag(Q) <- 0
  diag(Q) <- -rowSums(Q)
  Q
}

# ── Simulate one patient's full CTMC path ─────────────────────────────────────
sim_ctmc <- function(init_s, Q_i, max_t) {
  t_vec <- 0; s_vec <- init_s; s <- init_s; t <- 0
  while (t < max_t && s != 8L) {
    total_rate <- -Q_i[s, s]
    if (total_rate == 0) break
    t <- t + rexp(1L, total_rate)
    if (t > max_t) break
    probs    <- pmax(Q_i[s, ], 0)
    probs[s] <- 0
    s        <- sample.int(8L, 1L, prob = probs)
    t_vec    <- c(t_vec, t); s_vec <- c(s_vec, s)
  }
  list(times = t_vec, states = s_vec)
}

# State at a given observation time (last known state)
state_at <- function(traj, obs_t) {
  traj$states[max(which(traj$times <= obs_t))]
}

# ── Generate panel data ────────────────────────────────────────────────────────
records <- vector("list", N_PATIENTS)

for (i in seq_len(N_PATIENTS)) {
  init_s  <- if (treatment[i] == "Surgery") 1L else 2L
  Q_i     <- make_Q(age[i], tumor_size_cm[i])
  traj    <- sim_ctmc(init_s, Q_i, MAX_TIME)

  jitter  <- c(0, runif(length(obs_base) - 1L, -0.5, 0.5))
  obs_t   <- sort(unique(pmax(0, obs_base + jitter)))

  death_t <- if (8L %in% traj$states)
    traj$times[which(traj$states == 8L)[1L]]
  else Inf

  obs_t <- obs_t[obs_t < death_t]
  if (is.finite(death_t) && death_t <= MAX_TIME)
    obs_t <- c(obs_t, death_t)

  obs_s <- sapply(obs_t, state_at, traj = traj)

  records[[i]] <- data.frame(
    id            = i,
    time          = round(obs_t, 2),
    state         = obs_s,
    age           = age[i],
    tumor_size_cm = tumor_size_cm[i],
    DOI_mm        = DOI_mm[i],
    treatment     = treatment[i],
    stringsAsFactors = FALSE
  )
}

oscc2_data <- do.call(rbind, records)
rownames(oscc2_data) <- NULL
oscc2_data$state_label <- state_labels_ii[oscc2_data$state]

# Save synthetic data
write.csv(oscc2_data, "OSCC_HE_Synthetic_Data.csv", row.names = FALSE)

# First 10 rows
head(oscc2_data, 10)
   id  time state age tumor_size_cm DOI_mm treatment        state_label
1   1  0.00     1  59           2.2    7.8   Surgery            Surgery
2   1  1.38     1  59           2.2    7.8   Surgery            Surgery
3   1  1.82     1  59           2.2    7.8   Surgery            Surgery
4   1  3.01     1  59           2.2    7.8   Surgery            Surgery
5   1  6.04     3  59           2.2    7.8   Surgery PostOpSurveillance
6   1  8.50     3  59           2.2    7.8   Surgery PostOpSurveillance
7   1 11.61     3  59           2.2    7.8   Surgery PostOpSurveillance
8   1 17.91     3  59           2.2    7.8   Surgery PostOpSurveillance
9   1 24.01     3  59           2.2    7.8   Surgery PostOpSurveillance
10  1 35.86     6  59           2.2    7.8   Surgery                NED

Note that two covariates: age of patient and tumor size are included in the data ane encoded in the make_Q function used to create the synthetic data as follows:

  • age_eff <- exp(0.025 * (age_i - 60)) # older -> higher mortality
  • size_eff <- exp(0.180 * (size_i - 3.0)) # larger -> higher recurrence

Summary of Data Set

This table shows some of the meta-data describing the synthetic data set.

Show code for cohort summary
# Collect components for footnotes
trt_counts <- table(treatment)
died <- oscc2_data |>
  dplyr::filter(state == 8) |>
  dplyr::distinct(id) |>
  nrow()

# Main table: state visit counts
as.data.frame(table(State = oscc2_data$state_label)) |>
  gt() |>
  tab_options(table.font.size = px(12)) |>
  cols_label(State = "State", Freq = "Observations") |>
  tab_source_note(md(paste0(
    "**Dataset:** ", nrow(oscc2_data), " observations; ",
    N_PATIENTS, " patients"
  ))) |>
  tab_source_note(md(paste0(
    "**Treatment:** Surgery = ", trt_counts["Surgery"],
    "; DefinitiveRT = ", trt_counts["DefinitiveRT"]
  ))) |>
  tab_source_note(md(paste0(
    "**Mortality:** ", died, " / ", N_PATIENTS,
    " patients reached Death (",
    round(100 * died / N_PATIENTS, 1), "%)"
  )))
State Observations
AdjuvantChemoRT_POCRT 228
AdjuvantRT_PORT 333
Death 370
DefinitiveRT 570
LR_Recurrence 767
NED 4407
PostOpSurveillance 2381
Surgery 1578
Dataset: 10634 observations; 1000 patients
Treatment: Surgery = 794; DefinitiveRT = 206
Mortality: 370 / 1000 patients reached Death (37%)

First Patient Trajectory

This data frame shows the complete trajectory of sates visited for the first synthetic patient.

Show code for trajectory
oscc2_data[oscc2_data$id == 1,
           c("id", "time", "state", "state_label", "treatment",
             "age", "tumor_size_cm", "DOI_mm")]
   id  time state        state_label treatment age tumor_size_cm DOI_mm
1   1  0.00     1            Surgery   Surgery  59           2.2    7.8
2   1  1.38     1            Surgery   Surgery  59           2.2    7.8
3   1  1.82     1            Surgery   Surgery  59           2.2    7.8
4   1  3.01     1            Surgery   Surgery  59           2.2    7.8
5   1  6.04     3 PostOpSurveillance   Surgery  59           2.2    7.8
6   1  8.50     3 PostOpSurveillance   Surgery  59           2.2    7.8
7   1 11.61     3 PostOpSurveillance   Surgery  59           2.2    7.8
8   1 17.91     3 PostOpSurveillance   Surgery  59           2.2    7.8
9   1 24.01     3 PostOpSurveillance   Surgery  59           2.2    7.8
10  1 35.86     6                NED   Surgery  59           2.2    7.8
11  1 47.91     6                NED   Surgery  59           2.2    7.8
12  1 59.89     6                NED   Surgery  59           2.2    7.8

Survival Analysis

Overall Survival

This next plot shows overall survival (Kaplan-Meier) curves for patients in both treatment arms. Note that surgery exhibits a slightly better overall survival curve (Approximately 73% vs approximately 65% at 5 years). However, both curves show very good five year survival probabilities.

Three external benchmark points are overlaid on the Kaplan–Meier curves as filled diamonds.

  • Siegel et al. (2024) report approximately 84% five-year relative survival for localised oral cavity cancer in SEER (diagnoses 2013–2019); adjusting for background mortality at a median age of approximately 62 years using the SSA 2021 Period Life Table yields an estimated absolute five-year OS of ~77%, shown here against the Surgery arm curve. See the 5 year adjustment note below.
  • Hosni et al. (2019) observed a three-year OS of 71% (95% CI 67–75%) in 601 OSCC patients receiving adjuvant PORT at Princess Margaret Cancer Centre, also plotted against the Surgery arm.
  • Sher et al. (2011) reported a two-year OS of 63% in 12 oral cavity patients treated with definitive IMRT at Dana-Farber Cancer Institute, shown against the Definitive RT arm.

While not validating the survival curves, the benchmark points indicate that they are a plausible starting point for analysis.

Kaplan-Meier survival curves
library(survival)
library(survminer)

# One row per patient: time to death or end of follow-up
km_input <- oscc2_data |>
  dplyr::group_by(id, treatment) |>
  dplyr::summarise(
    time  = max(time),
    event = as.integer(any(state == 8)),
    .groups = "drop"
  )

km_fit_2arm <- survfit(Surv(time, event) ~ treatment, data = km_input)

km_plot <- ggsurvplot(
  km_fit_2arm,
  data        = km_input,
  conf.int    = FALSE,
  risk.table  = TRUE,
  palette     = c("#D55E00", "#0072B2"),   # DefinitiveRT=orange, Surgery=blue (alphabetical strata order)
  legend.labs = c("Definitive RT", "Surgery"),
  xlab        = "Time (months)",
  ylab        = "Survival probability",
  title       = "Overall Survival — Stage II OSCC",
  ylim        = c(0.25, 1),
  ggtheme     = theme_bw()
)

# Overlay external benchmark points (diamonds) on the KM plot
km_plot$plot <- km_plot$plot +
  annotate("point", x = 60, y = 0.77, shape = 23, size = 4,
           fill = "#0072B2", color = "black") +
  annotate("text",  x = 60, y = 0.77,
           label = "Siegel 2024\n(SEER ~77%)", vjust = -0.35, hjust = 0.5, size = 2.8) +
  annotate("point", x = 36, y = 0.71, shape = 23, size = 4,
           fill = "#0072B2", color = "black") +
  annotate("text",  x = 36, y = 0.71,
           label = "Hosni 2019\n(71% at 3 yr)", vjust = -0.35, hjust = 0.5, size = 2.8) +
  annotate("point", x = 24, y = 0.63, shape = 23, size = 4,
           fill = "#D55E00", color = "black") +
  annotate("text",  x = 24, y = 0.63,
           label = "Sher 2011\n(63% at 2 yr)", vjust = 1.6, hjust = 0.5, size = 2.8)

km_plot

Survival by Age and Tumour Size

This next panel of plots show the overall survival broken out by the covariates age and tumor size. Each panel indicates the four possible values of the covariates.

KM curves stratified by age × tumour size
library(broom)
library(ggplot2)
library(dplyr)

# ── One row per patient with covariate categories ─────────────────────────────
# Age split at 62 (simulation mean); tumour size split at 3 cm (Stage II midpoint)
km_cov <- oscc2_data |>
  dplyr::group_by(id, treatment) |>
  dplyr::summarise(
    time          = max(time),
    event         = as.integer(any(state == 8)),
    age           = first(age),
    tumor_size_cm = first(tumor_size_cm),
    .groups       = "drop"
  ) |>
  dplyr::mutate(
    age_grp  = factor(ifelse(age  < 62,  "Age < 62",   "Age ≥ 62"),
                      levels = c("Age < 62", "Age ≥ 62")),
    size_grp = factor(ifelse(tumor_size_cm < 3.0, "Tumour < 3 cm", "Tumour ≥ 3 cm"),
                      levels = c("Tumour < 3 cm", "Tumour ≥ 3 cm")),
    covariate_group = factor(
      interaction(age_grp, size_grp, sep = ", "),
      levels = c(
        "Age < 62, Tumour < 3 cm",
        "Age < 62, Tumour ≥ 3 cm",
        "Age ≥ 62, Tumour < 3 cm",
        "Age ≥ 62, Tumour ≥ 3 cm"
      )
    )
  )

pal4  <- c("#2166AC", "#92C5DE", "#D6604D", "#B2182B")
labs4 <- c("Age < 62, Tumour < 3 cm",
           "Age < 62, Tumour ≥ 3 cm",
           "Age ≥ 62, Tumour < 3 cm",
           "Age ≥ 62, Tumour ≥ 3 cm")

# ── Tidy survival estimates per arm ──────────────────────────────────────────
tidy_arm <- function(arm, data) {
  df  <- dplyr::filter(data, treatment == arm)
  fit <- survfit(Surv(time, event) ~ covariate_group, data = df)
  broom::tidy(fit, conf.int = FALSE) |>
    dplyr::mutate(
      treatment = arm,
      covariate_group = factor(
        sub("covariate_group=", "", strata),
        levels = levels(data$covariate_group)
      )
    )
}

surv_tidy <- dplyr::bind_rows(
  tidy_arm("Surgery",      km_cov),
  tidy_arm("DefinitiveRT", km_cov)
) |>
  dplyr::mutate(treatment = factor(treatment,
    levels = c("Surgery", "DefinitiveRT"),
    labels = c("Surgery Arm", "Definitive RT Arm")))

# ── Two-panel KM plot ─────────────────────────────────────────────────────────
ggplot(surv_tidy, aes(x = time, y = estimate,
                      colour = covariate_group,
                      group  = covariate_group)) +
  geom_step(linewidth = 0.8) +
  facet_wrap(~treatment, ncol = 2) +
  scale_colour_manual(values = pal4, labels = labs4) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1),
                     limits = c(0, 1), breaks = seq(0, 1, 0.25)) +
  scale_x_continuous(limits = c(0, 60), breaks = seq(0, 60, 12)) +
  labs(
    x      = "Time (months)",
    y      = "Survival probability",
    colour = "Age / Tumour Size",
    title  = "Overall Survival by Age and Tumour Size — Stage II OSCC"
  ) +
  theme_bw(base_size = 11) +
  theme(
    legend.position  = "bottom",
    legend.direction = "horizontal",
    legend.text      = element_text(size = 9),
    legend.title     = element_text(size = 10, face = "bold"),
    strip.text       = element_text(size = 11, face = "bold"),
    plot.title       = element_text(size = 12, face = "bold")
  ) +
  guides(colour = guide_legend(nrow = 2, byrow = TRUE))


What the subgroup curves show:

  • Surgery arm (n = 794): age is the dominant predictor of survival. The youngest, smallest-tumour subgroup (Age < 62, Tumour < 3 cm; dark blue) reaches 74% at 5 years, while the youngest with larger tumours (Age < 62, Tumour ≥ 3 cm; light blue) reaches 71% Among older patients (Age ≥ 62), those with smaller tumours (Tumour < 3 cm; orange) fare worst at 58%, while those with larger tumours (dark red) reach 51%. Even if these counter intuitive orderings are due to sampling variation and not a more fundamental underlying cause they may indicate a limitation of the this particular synthetic data set.

  • Definitive RT arm (n = 206): the four subgroups together contain only 206 patients, yielding wide confidence bands and coarse staircase steps; point estimates (69%, 68%, 55%, 51% for the dark-blue, light-blue, orange, and dark-red subgroups respectively) should be interpreted with caution. The broad pattern mirrors the Surgery arm in that older age is associated with lower survival, but the small-n instability makes subgroup comparisons across treatment arms unreliable.

Progression-Free Survival

Progression-free survival (PFS) is defined here as the time from treatment start to the first occurrence of either locoregional recurrence (S7) or death (S8), whichever comes first. Patients who reached neither endpoint by month 60 are censored at their last observed time.

Progression-free survival — Kaplan-Meier with at-risk table
library(survival)
library(survminer)

# ── Build per-patient PFS endpoint ───────────────────────────────────────────
# Event = first entry into LR_Recurrence (state 7) or Death (state 8)
pfs_events <- oscc2_data |>
  dplyr::filter(state %in% c(7L, 8L)) |>
  dplyr::group_by(id) |>
  dplyr::slice_min(time, n = 1, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::select(id, pfs_time = time, pfs_event_type = state_label)

last_obs <- oscc2_data |>
  dplyr::group_by(id) |>
  dplyr::slice_max(time, n = 1, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::select(id, last_time = time, treatment)

pfs_df <- last_obs |>
  dplyr::left_join(pfs_events, by = "id") |>
  dplyr::mutate(
    pfs_time  = dplyr::coalesce(pfs_time, last_time),
    pfs_event = as.integer(!is.na(pfs_event_type))
  )

# ── Fit KM curves ─────────────────────────────────────────────────────────────
km_pfs <- survfit(Surv(pfs_time, pfs_event) ~ treatment, data = pfs_df)
km_pfs_named <- km_pfs
names(km_pfs_named$strata) <- c("Definitive RT", "Surgery")

# ── Plot with at-risk table ───────────────────────────────────────────────────
ggsurvplot(
  km_pfs_named,
  data              = pfs_df,
  palette           = c("#D55E00", "#0072B2"),
  conf.int          = TRUE,
  conf.int.alpha    = 0.15,
  size              = 0.9,
  risk.table        = TRUE,
  risk.table.col    = "strata",
  risk.table.height = 0.25,
  tables.theme      = theme_cleantable(),
  xlab              = "Time (months)",
  ylab              = "Progression-free probability",
  title             = "Progression-Free Survival — Stage II OSCC",
  legend.title      = "Treatment",
  legend.labs       = c("Definitive RT", "Surgery"),
  legend            = "bottom",
  surv.median.line  = "hv",
  break.time.by     = 12,
  xlim              = c(0, 60),
  ylim              = c(0, 1),
  ggtheme           = theme_minimal(base_size = 13),
  fontsize          = 4
)

PFS at key timepoints

PFS estimates at 12, 24, 36, 48, 60 months
tbl_fun <- function(s) {
  data.frame(
    Treatment = as.character(s$strata),
    Time      = s$time,
    PFS       = round(s$surv,  3),
    Lower95   = round(s$lower, 3),
    Upper95   = round(s$upper, 3)
  )
}
tbl_fun(summary(km_pfs_named, times = c(12, 24, 36, 48, 60))) |>
  knitr::kable(
    col.names = c("Treatment", "Month", "PFS", "95% CI lower", "95% CI upper"),
    caption   = "Kaplan-Meier PFS estimates at annual checkpoints"
  )
Kaplan-Meier PFS estimates at annual checkpoints
Treatment Month PFS 95% CI lower 95% CI upper
Definitive RT 12 0.631 0.569 0.701
Definitive RT 24 0.573 0.509 0.645
Definitive RT 36 0.544 0.480 0.616
Definitive RT 48 0.519 0.455 0.592
Definitive RT 60 0.449 0.386 0.523
Surgery 12 0.783 0.755 0.813
Surgery 24 0.683 0.651 0.716
Surgery 36 0.617 0.584 0.652
Surgery 48 0.564 0.531 0.600
Surgery 60 0.511 0.477 0.548

What the PFS curves show

  • Surgery achieves notably higher early PFS: approximately 78% at 12 months versus 63% for Definitive RT — a 15 percentage point difference driven by the higher probability of early locoregional treatment failure in the RT arm.
  • The gap narrows progressively over the 60-month horizon, with Surgery at 51% and Definitive RT at 45% at month 60.
  • The wider confidence band on the Definitive RT curve reflects the smaller sample size in that arm.

Time Patients Spend in Each Health State

The total times that patients are expected to spend in each state over a 60-month horizon drive the economics analysis. The ability to make this calculation is the primary motivation for modeling the treatment regimen as a continuous time Markov chain.

State occupancy function
library(expm)

# Adapted from plot_msm_state_occupancy() (Breast_Cancer_v3.qmd).
# Accepts Q_base directly instead of a fitted msm model, and accepts
# a start_state argument so the two treatment arms can be compared.
plot_state_occupancy <- function(Q,
                                 start_state  = 1,
                                 tmax         = 60,
                                 tstep        = 0.5,
                                 state_labels = NULL,
                                 title        = "State Occupancy Probabilities") {
  n_states    <- nrow(Q)
  start_probs <- rep(0, n_states)
  start_probs[start_state] <- 1
  time <- seq(0, tmax, by = tstep)

  occ <- matrix(NA, nrow = length(time), ncol = n_states)
  colnames(occ) <- paste0("S", seq_len(n_states))
  for (i in seq_along(time))
    occ[i, ] <- start_probs %*% expm(Q * time[i])

  df_long <- as.data.frame(occ) |>
    dplyr::mutate(time = time) |>
    tidyr::pivot_longer(-time, names_to = "state", values_to = "prob")

  if (!is.null(state_labels))
    df_long$state <- factor(df_long$state,
      levels = paste0("S", seq_len(n_states)),
      labels = state_labels)

  ggplot(df_long, aes(x = time, y = prob, colour = state)) +
    geom_line(linewidth = 1.0) +
    labs(x = "Time (months)", y = "Probability",
         title = title, colour = "State") +
    theme_bw(base_size = 12)
}

# Expected time in each state: trapezoid integral of P(t) over 0-60 months
compute_state_time <- function(Q, start_state, tmax = 60, tstep = 0.5) {
  n_states    <- nrow(Q)
  start_probs <- rep(0, n_states)
  start_probs[start_state] <- 1
  time <- seq(0, tmax, by = tstep)
  occ  <- matrix(NA, nrow = length(time), ncol = n_states)
  for (i in seq_along(time))
    occ[i, ] <- start_probs %*% expm(Q * time[i])
  apply(occ, 2, function(p)
    sum(diff(time) * (p[-1] + p[-length(p)]) / 2))
}
Expected state time by arm
time_surg <- compute_state_time(Q_base, start_state = 1)
time_def  <- compute_state_time(Q_base, start_state = 2)

data.frame(
  State      = paste0("S", 1:8),
  Label      = state_labels_ii,
  Surgery_mo = round(time_surg, 2),
  DefRT_mo   = round(time_def,  2)
)
  State                 Label Surgery_mo DefRT_mo
1    S1               Surgery       1.51     0.00
2    S2          DefinitiveRT       0.00     3.01
3    S3    PostOpSurveillance      11.96     0.00
4    S4       AdjuvantRT_PORT       0.75     0.00
5    S5 AdjuvantChemoRT_POCRT       0.60     0.00
6    S6                   NED      28.83    35.54
7    S7         LR_Recurrence       4.09     5.67
8    S8                 Death      12.26    15.78

The state occupancy plots show the probability of being in each state as time progresses. For both plots, focusing on a particular time point gives a gives an estimate of a patient’s probable health state.

State Occupancy: Surgery

Surgery occupancy plot
plot_state_occupancy(
  Q            = Q_base,
  start_state  = 1,
  state_labels = state_labels_ii,
  title        = "State Occupancy Probabilities — Surgery Arm"
)

State Occupancy: Definitive RT

Definitive RT occupancy plot
plot_state_occupancy(
  Q            = Q_base,
  start_state  = 2,
  state_labels = state_labels_ii,
  title        = "State Occupancy Probabilities — Definitive RT Arm"
)


Health Economic Evaluation — Patient Perspective

The health Economics model presented uses QALYs calculated with with EQ-5D utilities. Although these utilities are abstractions they do provide a measure, however imperfect, of how patients treated for OSCC have rated the ensemble of experiences related to each possible health state.

The model does not to calculate financial costs for the following reasons:

  1. In the United States, there is no practical way to estimate the final direct cost to patients who are covered by private insurance or Medicare. The U.S. health care system is set up so that the final cost of treatments borne by patients are only made known after the treatments have be received.
  2. For patients with adequate insurance it is not likely cost would not be directly relevant to patient decisions.
  3. For patients without insurance, the financial burden of treatment would dominate all other considerations.
  4. From a patient’s point of view, there is merit in clarifying the quality of life to be expected after undergoing a particular treatment. For example, it is not difficult to imagine that elderly patients would choose palliative care over a course of chemotherapy.

Quality-Adjusted Life Years (QALYs)

A Quality-Adjusted Life Year (QALY) combines the quantity and quality of life into a single metric. One QALY represents one year lived in perfect health; a year spent in a health state with diminished quality counts for less than one QALY in proportion to that state’s utility weight.

In this model, QALYs are calculated in three steps:

  1. Assign utility weights: Each of the eight health states is assigned an EQ-5D utility value between 0 (death) and 1 (perfect health), drawn from peer-reviewed literature on head-and-neck cancer patients.

  2. Estimate time in each state: The continuous-time Markov chain (CTMC) simulation provides the expected number of months the cohort spends in each state over a 60-month horizon, separately for the Surgery arm and the Definitive Radiation Therapy arm.

  3. Compute utility-weighted life years: For each state, the time (in months) is multiplied by its utility weight and the products are summed across all states. Dividing by 12 converts the result to years:

where is the expected months in state and is its EQ-5D utility. The two treatment arms are evaluated on the same 60-month horizon, allowing a direct QALY comparison from the patient’s perspective.

Utility Values

Health state utility values map each model state to a preference-based quality-of-life weight on the 0–1 scale (0 = death, 1 = perfect health). The weights below are EQ-5D-based defaults drawn from the published literature. S6 (NED) carries the highest utility among living states and serves as the reference; a NED-normalized column (NED = 1.00) is included for sensitivity work.

The sources for the utility values are provided as comments in the code below. Complete references are provided in the Reference section at the end.

Define and display utility weights
# Health state utility values (EQ-5D scale, 0 = death, 1 = perfect health).
# S6 (NED) is the reference (highest) living state.
# Sources: 
# S1 de Almeida et al. 2014; Noel et al. 2015
# S2 Sprave et al. 2022 (during-RT EQ-5D ~0.83, n=366); Truong/RTOG 0522 2017 retained for S5
# S3 Govers et al. 2016
# S4 Sprave et al. 2022 (during-RT EQ-5D ~0.83, n=366); Govers et al. 2016
# S5 Truong et al. / RTOG 0522 2017 (primary); Sprave et al. 2022 adjuvant CRT data support ~0.83 but POCRT toxicity justifies retaining 0.775 — see inline note
# S6 Noel et al. 2015 (reference state)
# S7 Meregaglia & Cairns 2017 (systematic review confirming evidence gap; only patient EQ-5D for recurrence found is median 0.70, del Barco et al. 2016, palliative/metastatic context); 0.55 is modeller's assumption
# S8 Convention (absorbing state)
utility <- c(
  S1 = 0.60,   # Surgery: acute peri-operative period
  S2 = 0.83,   # DefinitiveRT: during 6-7 week RT course (Sprave et al. 2022, mean EQ-5D at RT completion = 0.830, n=366)
  S3 = 0.72,   # PostOpSurveillance: recovery phase
  S4 = 0.83,   # PORT: adjuvant RT toxicity (Sprave et al. 2022, mean EQ-5D at RT completion = 0.830, n=366)
  S5 = 0.775, # POCRT: concurrent chemoRT (revised from 0.60). Primary source: Truong/RTOG 0522 3-month EQ-5D proxy (CIS arm 0.78, CET/CIS arm 0.77); end-of-treatment value collected but not reported in paper. Comparison: Sprave et al. (2022) adjuvant CRT cohort had baseline HI = 0.849 with no significant within-group change to RT completion, and CRT vs RT-alone did not differ at RT completion (p = 0.624); the adjuvant cohort data would support ~0.83. Value retained at 0.775 as a deliberate conservative estimate: POCRT involves high-dose cisplatin concurrent with post-surgical RT (higher toxicity than Sprave 2022 mixed RT cohort), and RTOG 0522 provides the only patient-reported utility from an actual CRT trial. A sensitivity analysis using 0.83 would narrow the Surgery vs. Def RT QALY difference by ~0.009 QALYs (0.598 months in S5).
  S6 = 0.82,   # NED: reference state (Noel 2015)
  S7 = 0.55,   # LR Recurrence: modeller's assumption (no directly reported mean EQ-5D for LR recurrence; Meregaglia & Cairns 2017 confirms evidence gap; only available patient EQ-5D is median 0.70 from del Barco 2016 in palliative recurrent/metastatic HNC)
  S8 = 0.00    # Death: absorbing state
)

df_utility <- data.frame(
  State        = paste0("S", 1:8),
  Label        = state_labels_ii,
  Primary_Source = c(
    "de Almeida (2014); Noel (2015)",
    "Sprave (2022)",
    "Govers (2016)",
    "Sprave (2022)",
    "Truong / RTOG 0522 (2017)",
    "Noel (2015)",
    "Modeller's assumption",
    "Convention"
  ),
  Utility_EQ5D = utility,
  Utility_NED1 = round(utility / utility["S6"], 3),
  QALY_Surgery = round(time_surg * utility / 12, 3),
  QALY_DefRT   = round(time_def  * utility / 12, 3),
  QALY_Diff    = round((time_surg - time_def) * utility / 12, 3)
)
rownames(df_utility) <- NULL

df_utility |>
  gt() |>
  tab_options(table.font.size = px(10)) |>
  cols_label(
    State        = "State",
    Label        = "Health State",
    Primary_Source = "Primary Source",
    Utility_EQ5D = "EQ-5D Utility",
    Utility_NED1 = "NED-Normalised",
    QALY_Surgery = "QALYs — Surgery",
    QALY_DefRT   = "QALYs — Definitive RT",
    QALY_Diff    = "Difference (Surg − RT)"
  ) |>
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_body(rows = State == "S6")
  ) |>
  tab_style(
    style = cell_text(color = "#27AE60", weight = "bold"),
    locations = cells_body(columns = QALY_Diff, rows = QALY_Diff > 0)
  ) |>
  tab_style(
    style = cell_text(color = "#C0392B", weight = "bold"),
    locations = cells_body(columns = QALY_Diff, rows = QALY_Diff < 0)
  ) |>
  tab_spanner(
    label   = "Utility",
    columns = c(Utility_EQ5D, Utility_NED1)
  ) |>
  tab_spanner(
    label   = "QALYs by Treatment Arm",
    columns = c(QALY_Surgery, QALY_DefRT, QALY_Diff)
  ) |>
  grand_summary_rows(
    columns  = c(QALY_Surgery, QALY_DefRT, QALY_Diff),
    fns      = list(Total ~ sum(.)),
    fmt      = ~ fmt_number(., decimals = 3)
  ) |>
  tab_footnote(
    footnote  = md("Modeller's assumption. No directly reported patient EQ-5D for LR recurrence exists; best available evidence is median 0.70 (del Barco et al. 2016, palliative/metastatic context). See Meregaglia & Cairns (2017) in References."),
    locations = cells_body(columns = Primary_Source, rows = State == "S7")
  ) |>
  tab_source_note("NED-Normalised: utility relative to S6 NED = 1.00. QALYs = utility-weighted months / 12 over a 60-month horizon. Difference = Surgery − Definitive RT; green = Surgery advantage, red = RT advantage.")
State Health State Primary Source
Utility
QALYs by Treatment Arm
EQ-5D Utility NED-Normalised QALYs — Surgery QALYs — Definitive RT Difference (Surg − RT)
S1 Surgery de Almeida (2014); Noel (2015) 0.600 0.732 0.076 0.000 0.076
S2 DefinitiveRT Sprave (2022) 0.830 1.012 0.000 0.208 -0.208
S3 PostOpSurveillance Govers (2016) 0.720 0.878 0.718 0.000 0.718
S4 AdjuvantRT_PORT Sprave (2022) 0.830 1.012 0.052 0.000 0.052
S5 AdjuvantChemoRT_POCRT Truong / RTOG 0522 (2017) 0.775 0.945 0.039 0.000 0.039
S6 NED Noel (2015) 0.820 1.000 1.970 2.429 -0.459
S7 LR_Recurrence Modeller's assumption1 0.550 0.671 0.188 0.260 -0.072
S8 Death Convention 0.000 0.000 0.000 0.000 0.000
Total 3.043 2.897 0.146
1 Modeller’s assumption. No directly reported patient EQ-5D for LR recurrence exists; best available evidence is median 0.70 (del Barco et al. 2016, palliative/metastatic context). See Meregaglia & Cairns (2017) in References.
NED-Normalised: utility relative to S6 NED = 1.00. QALYs = utility-weighted months / 12 over a 60-month horizon. Difference = Surgery − Definitive RT; green = Surgery advantage, red = RT advantage.

QALY Comparison by Treatment Arm

With utility weights and state occupancy times established, QALYs can be compared across the two treatment arms. The stacked bar chart below shows the utility-weighted life years accrued in each health state over the 60-month horizon. Each bar is broken down by health state, so the chart reveals not just the total QALY difference between arms but where that difference arises. It also reveals which states contribute most of the advantages of the treatment and which states are similar for both treatments.

QALY comparison chart
# Expected utility-weighted months per state per arm,
# using state occupancy integrals and EQ-5D utility weights.
# Total QALYs = sum(time_in_state_months * utility) / 12
qaly_surg <- time_surg * utility
qaly_def  <- time_def  * utility

state_palette <- c(
  S1 = "#E67E22", S2 = "#8E44AD", S3 = "#F39C12",
  S4 = "#2980B9", S5 = "#16A085", S6 = "#27AE60", S7 = "#C0392B"
)

df_qaly <- data.frame(
  State   = rep(paste0("S", 1:8), 2),
  Label   = rep(state_labels_ii, 2),
  Arm     = rep(c("Surgery", "Definitive RT"), each = 8),
  QALY_mo = c(qaly_surg, qaly_def)
) |>
  dplyr::filter(State != "S8") |>
  dplyr::mutate(
    State = factor(State, levels = paste0("S", 1:7)),
    Arm   = factor(Arm,   levels = c("Surgery", "Definitive RT"))
  )

totals_label <- data.frame(
  Arm   = factor(c("Surgery", "Definitive RT"),
                 levels = c("Surgery", "Definitive RT")),
  total = c(sum(qaly_surg), sum(qaly_def)) / 12,
  label = paste0(round(c(sum(qaly_surg), sum(qaly_def)) / 12, 2), " QALYs")
)

ggplot(df_qaly, aes(x = Arm, y = QALY_mo / 12, fill = State)) +
  geom_col(width = 0.55) +
  geom_text(
    data        = totals_label,
    aes(x = Arm, y = total + 0.06, label = label),
    inherit.aes = FALSE,
    fontface    = "bold", size = 4
  ) +
  scale_fill_manual(
    values = state_palette,
    labels = setNames(state_labels_ii[1:7], paste0("S", 1:7))
  ) +
  scale_y_continuous(
    expand = expansion(mult = c(0, 0.08)),
    breaks = seq(0, 3, 0.5)
  ) +
  labs(
    x        = NULL,
    y        = "Quality-Adjusted Life Years (QALYs)",
    fill     = "Health State",
    title    = "Expected QALYs by Health State and Treatment Arm",
    subtitle = "60-month horizon; utility weights from EQ-5D literature defaults",
    caption  = "State S8 (Death, utility = 0) excluded; contributes 0 QALYs"
  ) +
  theme_bw(base_size = 12)

A few things worth noting from the chart:

  • Surgery accumulates 3.04 QALYs vs. 2.9 QALYs for Definitive RT over the 60-month horizon — a difference of ~0.14 QALYs.
  • The dominant contributor in both arms is NED (green), which is expected given its high utility (0.82) and long sojourn time.
  • Surgery’s advantage comes largely from the PostOpSurveillance (orange) segment, which has a better utility (0.72) than Definitive RT’s LR Recurrence (dark red) burden.
  • The LR Recurrence segment is visibly larger for the Definitive RT arm (0.26 QALYs vs. 0.19), consistent with the higher recurrence probability built into that arm’s jump chain.
  • These QALY estimates are model-based (from Q_base × utility) and do not yet incorporate cost.

Discussion

The intended users of the class of models I am proposing are physicians and their support teams who are making treatment decisions and explaining the possible consequences of the treatments to their patients. In the first use case, physicians may find the formality of the model and its baseline estimates useful in finalizing their decisions and documenting their decision making process. Having a model to inform treatment decisions should be especially helpful in situations where multiple physicians collaborate on deciding treatment options. At the very least, a model would be useful in verifying shared assumptions.

The second use of the model to help physicians and their care teams communicate with patients about the consequences of various treatment options is more speculative. To be truly helpful to patients physicians and their care teams must generate a compelling narrative for each health state explaining the side effects of the treatments experienced in each health state along with the strategies that previous patients have employed to cope with them. At the very least the exercise of explaining a journey through various states and interpreting the QALY’s may be useful in structuring patient conversations and setting expectations.

Technical Notes

Five year adjustment process

Five-year background survival for the general US population at age 62, , is computed from the 2021 SSA Period Life Table as the product of annual survival probabilities across ages 62 to 66:

where is the one-year death probability at exact age (SSA notation). Using the 2021 table values for ages 62–66: = 0.01648, 0.01762, 0.01876, 0.01991, 0.02110 and = 0.01014, 0.01085, 0.01155, 0.01222, 0.01295, giving (male) and (female). The sex-weighted background survival is:

where the 70:30 sex split is from SEER oral cavity incidence data. Absolute five-year OS is then obtained as:

where is the SEER five-year relative survival for localized oral cavity cancer (Siegel et al. 2024, CA Cancer J Clin, Figure 5). The code below details this calculation and provides a sensitivity check at age 65 (note: age-65 values are pending verification).

Siegel 2024: relative → absolute OS derivation
# US Social Security Administration period life tables (2021 release, used for 2019 SEER cohort).
# 5-year survival probabilities by sex at age 62 (closest to OSCC median diagnosis age).
# Computed as prod(1 - q_x) for x = 62:66 from Table 4c6 (2021 Trustees Report cycle).
# Source: Social Security Administration. Period Life Table, 2021 (2024 Trustees Report).
# https://www.ssa.gov/oact/STATS/table4c6.html
surv_male_62   <- 0.9096  # 5-yr background survival, males age 62;   SSA 2021 prod(1-qx), x=62:66
surv_female_62 <- 0.9436  # 5-yr background survival, females age 62; SSA 2021 prod(1-qx), x=62:66

# OSCC sex distribution: ~70% male, ~30% female (SEER, oral cavity)
prop_male <- 0.70
expected_survival_5yr <- prop_male * surv_male_62 + (1 - prop_male) * surv_female_62

# SEER 5-year relative survival for localized oral cavity (Siegel 2024, Fig. 5)
relative_survival_5yr <- 0.84

# Absolute OS = relative survival × expected (background) survival
absolute_os_5yr <- relative_survival_5yr * expected_survival_5yr

# Sensitivity: 5-yr background survival at age 65 (upper end of median diagnosis age range).
# Source: Social Security Administration. Period Life Table, 2021 (2024 Trustees Report).
# Computed as prod(1 - q_x) for x = 65:69. https://www.ssa.gov/oact/STATS/table4c6.html
surv_male_65   <- 0.8923  # 5-yr background survival, males age 65;   SSA 2021 prod(1-qx), x=65:69
surv_female_65 <- 0.9320  # 5-yr background survival, females age 65; SSA 2021 prod(1-qx), x=65:69
expected_surv_65 <- prop_male * surv_male_65 + (1 - prop_male) * surv_female_65
absolute_os_65   <- relative_survival_5yr * expected_surv_65

data.frame(
  Assumption              = c("Age 62 (lower bound)", "Age 65 (upper bound)"),
  Background_5yr_survival = round(c(expected_survival_5yr, expected_surv_65), 3),
  Absolute_5yr_OS         = round(c(absolute_os_5yr, absolute_os_65), 3)
)
            Assumption Background_5yr_survival Absolute_5yr_OS
1 Age 62 (lower bound)                   0.920           0.773
2 Age 65 (upper bound)                   0.904           0.760
Siegel 2024: relative → absolute OS derivation
# Age 62: background survival 0.920 × relative survival 0.84 = 0.773 (~77%).
# Age 65 sensitivity: background survival 0.904 × relative survival 0.84 = 0.760 (~76%).

Using the code with Real Data

There is no doubt that models based on real patient data would be more convincing. To this end, I have structured the format of the generated synthetic data file so that it is suitable for fitting a continuous time Markov chain using the R package msm. I have verified that msm is able to fit the eight state sample model from the synthetic data. It is reasonable to expect that models of similar complexity could be fit from real patient data.

References

The references in this section are neither complete nor definitive. The literature is enormous. They have been selected to support the default values for the user inputs.

References for Background Mortality

  • Social Security Administration. Period Life Table, 2021 (as used in the 2024 Trustees Report). Available at: https://www.ssa.gov/oact/STATS/table4c6.html. Accessed 2026-03-18. Five-year background survival probabilities for the general US population at ages 62 and 65, computed as where is the one-year death probability at exact age . Values at age 62: male , female . Used to convert SEER relative survival (Siegel 2024) to absolute OS for the calibration benchmark. Values at age 65: male , female (computed from = 0.01991, 0.02110, 0.02242, 0.02385, 0.02536 male; 0.01222, 0.01295, 0.01384, 0.01486, 0.01603 female).

References for Transition Probabilities and Sojourn Times

  • Bernier et al. (2004) “Postoperative Irradiation with or without Concomitant Chemotherapy for Locally Advanced Head and Neck Cancer”, N Engl J Med. 350:1945–1952. PMID: 15128894. Landmark EORTC 22931 RCT establishing concurrent chemoRT as standard of care for high-risk post-operative HNSCC. Supports jump_P[5,6] and the locoregional control rates in the POCRT arm (S5).

  • Blatt et al. (2022) “Tumor Recurrence and Follow-Up Intervals in Oral Squamous Cell Carcinoma”, J Clin Med. 11(23):7061. PMID: 36498636. PMC: PMC9740063. DOI: 10.3390/jcm11237061. University Medical Centre Mainz, n = 760 OSCC patients. Supports S3 NED declaration timing ~24 months (mean recurrence interval 24 ± 26 months; 50% of recurrences by 24 months) and S3 mean sojourn revision to 22 months.

  • Brands et al. (2019) “Time Patterns of Recurrence and Second Primary Tumors in a Large Cohort of Patients Treated for Oral Cavity Cancer”, Cancer Med. 8(12):5810–5819. DOI: 10.1002/cam4.2124. Retrospective cohort of 594 OSCC patients; the great majority of recurrences occur in the first 2 years; 5-year cumulative second-event risk ~30%. Supports S3 sojourn revision to 22 months and S6 NED calibration.

  • Contrera et al. (2022) “Outcomes for Recurrent Oral Cavity Squamous Cell Carcinoma”, Oral Oncol. 134:106127. PMID: 36155359. DOI: 10.1016/j.oraloncology.2022.106127. MD Anderson, n = 259 salvage surgeries (1997-2011); 5-year OS 44.2% for surgical candidates vs. 1.5% for nonsurgical therapy; 51% second recurrence at median 17 months. Provides contextual support for the salvage surgery framework; primary source for current values jump_P[7,6] = 0.43 and jump_P[7,8] = 0.57 is Lee et al. (2024).

  • Cooper et al. (2004) “Postoperative Concurrent Radiotherapy and Chemotherapy for High-Risk Squamous-Cell Carcinoma of the Head and Neck”, N Engl J Med. 350:1937–1944. PMID: 15084618. RTOG 9501 RCT; locoregional control ~60% at 5 years in the concurrent chemoRT arm. Supports S5 jump probabilities.

  • Correia et al. (2026) “Timely Matters: Predictors of Delay in Oral Cavity Cancer Patients Across the Care Continuum”. Laryngoscope Investig Otolaryngol. 11(2):e70363. DOI: 10.1002/lio2.70363. (Also available via eScholarship. n = 93 OCSCC patients. Median surgery-to-adjuvant RT interval 8.4 weeks (same institution) and 9.3 weeks (different facility). Supports S1 mean sojourn revision to 1.5 months.

  • Dayan et al. (2023) “Predictors of prolonged treatment time intervals in oral cavity cancer”, Oral Oncol. 106622. DOI: 10.1016/j.oraloncology.2023.106622. CHUM Montreal, n = 145 multimodal OCSCC patients. Median surgery-to-PORT interval 64 days = 2.1 months. Supports S1 mean sojourn revision.

  • Goodwin (2000) “Salvage Surgery for Patients with Recurrent Squamous Cell Carcinoma of the Upper Aerodigestive Tract: When Do the Ends Justify the Means?”, Laryngoscope. 110(suppl 93):1–18. Cited in v2 for salvage success rate 15–25%. Superseded in v3 by Lee et al. 2024 for jump_P[7,6]; retained for historical context.

  • Graboyes et al. (2017) “Adherence to National Comprehensive Cancer Network Guidelines for Time to Initiation of Postoperative Radiation Therapy for Patients with Head and Neck Cancer”, Cancer. 123(14):2651–2660. DOI: 10.1002/cncr.30651. NCDB analysis, n = 47,273 HNSCC patients; over 50% failed to initiate PORT within the NCCN-recommended 6 weeks. Supports real-world S1 mean sojourn estimate of 1.5 months.

  • Hosni et al. (2019) “Predictors of Early Recurrence Prior to Planned Postoperative Radiation Therapy for Oral Cavity Squamous Cell Carcinoma and Outcomes Following Salvage Intensified Radiation Therapy”, Int J Radiat Oncol Biol Phys. 103(2):363–373. PMID: 30244160. Princess Margaret Cancer Centre, n = 601 OSCC patients (2003–2015); 3-year OS ~71% (95% CI 67–75%) after adjuvant PORT. Supports S4 jump probabilities (jump_P[4, 6:8]).

  • Katsoulakis et al. (2018) “Long-term outcomes in oral cavity squamous cell carcinoma with adjuvant and salvage radiotherapy after surgery”, Laryngoscope. PMID: 29637571. DOI: 10.1002/lary.27191. Memorial Sloan Kettering / VA Tampa; Evangelia Katsoulakis (first author). Provides contextual reference for the adjuvant RT pathway; jump_P[2,6] is now sourced from Dana-Farber Group 2011 and Studer et al. 2007 above.

  • Lee et al. (2024) “Clinical Outcome of Salvage Surgery in Patients with Recurrent Oral Cavity Cancer: A Systematic Review and Meta-Analysis”, Head Neck. 46(11):2901–2909. PMID: 39243149. DOI: 10.1002/hed.27928. 14 retrospective cohort studies, n = 2,069; pooled 5-year OS after salvage surgery = 43.0%; late-relapse subgroup 63.8% vs. 30.0% for early relapse. Primary source for jump_P[7,6] = 0.43 (salvage success) and jump_P[7,8] = 0.57 (mortality); revised from 0.25/0.75 — previous values over-estimated post-recurrence mortality relative to this pooled estimate.

  • Liu et al. (2007) “Impact of Recurrence Interval on Survival of Oral Cavity Squamous Cell Carcinoma Patients after Local Relapse”, Otolaryngol Head Neck Surg. 136(1):112–118. PMID: 17210345. DOI: 10.1016/j.otohns.2006.07.002. n = 1,687 oral cancer patients; 5-year OS after local recurrence 31.56%; recurrence within 18 months predicted higher mortality. Supports S7 sojourn of 12.5 months.

  • Mehanna et al. (2016) “PET-CT Surveillance versus Neck Dissection in Advanced Head and Neck Cancer”, N Engl J Med. 374(15):1444–1454. PMID: 26958921. DOI: 10.1056/NEJMoa1514493. PET-NECK RCT, n = 564 patients; response imaging performed at 12 weeks post-chemoradiotherapy; established ≥12-week post-RT interval as standard for response assessment in HNSCC. Supports S2 mean sojourn of 3.0 months (RT course 1.5 mo + ≥12-week response assessment window).

  • Luryi et al. (2014) “Positive Surgical Margins in Early Stage Oral Cavity Cancer: An Analysis of 20,602 Cases”, Otolaryngol Head Neck Surg. 151(6):984–990. PMID: 25210849. Large NCDB analysis of surgical margin rates and downstream outcomes in early-stage oral cavity cancer; early-stage N0 patients with negative margins have materially lower recurrence rates than all-stage series. Supports S3 transition probability framework, jump_P[3,7] = 0.18, and S6 NED calibration to 120 months.

  • Nathan et al. (2025) “The Influence of Reconstruction following Hemiglossectomy on Perioperative Outcomes”, Int J Otorhinolaryngol Head Neck Surg. DOI: 10.18203/issn.2454-5929.ijohns20252980. NSQIP database study, n = 866 hemiglossectomy patients (2008–2022); consistent with ~1–2% 30-day mortality in modern series. Supports jump_P[1,8] = 0.015.

  • Ord, Kolokythas & Reynolds (2006) “Surgical Salvage for Local and Regional Recurrence in Oral Cancer”, J Oral Maxillofac Surg. 64(9):1409–1414. PMID: 16916677. DOI: 10.1016/j.joms.2006.05.026. Reports outcomes of surgical salvage in recurrent oral cancer; data on local and regional recurrence rates and survival after primary surgery. Supports jump_P[3,7] = 0.18 (LR recurrence 12–18% for Stage II) and S3 mean sojourn revision to 22 months.

  • Sher et al. (2011) “Treatment of Oral Cavity Squamous Cell Carcinoma with Adjuvant or Definitive Intensity-Modulated Radiation Therapy”, Int J Radiat Oncol Biol Phys. 2011. PMID: 21531515. Dana-Farber Cancer Institute; n = 42 OCSCC (30 adjuvant IMRT, 12 definitive IMRT); Stage I–IV (24% Stage II); 2-yr LRC for definitive IMRT = 64%, 2-yr OS = 63%. Corroborated by Studer G, Zwahlen RA, Graetz KW et al. (2007), “IMRT in oral cavity cancer”, Radiat Oncol. 2:16. PMID: 17430599. DOI: 10.1186/1748-717X-2-16. University Hospital Zurich; n = 58 OCC IMRT patients; T1 LC = 95%; T2–4 and recurred stages LC ~50–60% at 2 years; definitive IMRT LC = 43% (but cohort was 69% T3/4 or recurred). Primary reference pair for jump_P[2,6] = 0.65 — definitive RT arm locoregional control for T2N0M0 Stage II; value is consistent with the early-stage subgroup in both series.

  • Shetty et al. (2022) “Salvage Surgery in Recurrent Oral Squamous Cell Carcinoma”, Front Oral Health. PMC8831824. Review of salvage surgery outcomes; radiation-naive recurrent OCSCC after salvage surgery: 5-year OS 59%, recurrence-free survival 60%. Provides supporting context for the salvage surgery framework; current value jump_P[7,6] = 0.43 is sourced from Lee et al. (2024).

  • Siegel et al. (2024) “Cancer Statistics, 2024”, CA Cancer J Clin. 74(1):12–49. PMID: 38230766. DOI: 10.3322/caac.21820. Annual American Cancer Society SEER-based statistics report. Figure 5 reports 5-year relative survival for localized oral cavity cancer (SEER localised stage, encompassing Stage I–II) as approximately 84% (diagnoses 2013–2019, follow-up through 2020). Note: relative survival overstates absolute OS — adjusting for sex-weighted background mortality from the SSA 2021 Period Life Table at median age 62 () yields an estimated absolute 5-yr OS of approximately 77% () for Stage II localised oral cavity. Used as the primary population-level calibration benchmark for the Surgery arm.

  • Szturz et al. (2017) “Weekly Low-Dose Versus Three-Weekly High-Dose Cisplatin for Concurrent Chemoradiation in Locoregionally Advanced Non-Nasopharyngeal Head and Neck Cancer: A Systematic Review and Meta-Analysis”, Oncologist. 22(9):1056–1066. PMID: 28533474.

  • Tam et al. (2017) “Estimating Survival After Salvage Surgery for Recurrent Oral Cavity Cancer”, JAMA Otolaryngol Head Neck Surg. 143(7):685–690. PMID: 28448645. Reports survival outcomes following salvage surgery for recurrent oral cavity SCC. Supports S7 transition structure.

  • Tassone et al. (2023) “Going Off Guidelines: An NCDB Analysis of Missed Adjuvant Therapy Among Surgically Treated Oral Cavity Cancer”, Otolaryngol Head Neck Surg. 169(3):632–641. PMID: 36939392. DOI: 10.1002/ohn.205. NCDB analysis, n = 53,503; establishes PORT vs. POCRT indications and allocation rates. Supports jump_P[1,4] = 0.25 and jump_P[1,5] = 0.15.

  • Temam et al. (2005) “Treatment of the N0 Neck during Salvage Surgery after Radiotherapy of Head and Neck Squamous Cell Carcinoma”, Head Neck. 27(8):653–658. Cited in v2 for salvage success rates. Superseded in v3 by Lee et al. 2024; retained for historical context.

References for EQ-5D Utilities

  • de Almeida et al. (2014) “Preferences and Utilities for Health States after Treatment for Oropharyngeal Cancer: Transoral Robotic Surgery versus Definitive (Chemo)radiotherapy”, Head Neck. 36(4):529–539. Reports EQ-5D utility values across treatment modalities; informs utility weights for treatment and NED states.

  • Govers et al. (2016) “Quality of Life after Different Procedures for Regional Control in Oral Cancer Patients: Cross-Sectional Survey”, Clin Otolaryngol. 41(3):228–235. EQ-5D-3L measurement in oral cavity OSCC patients following different regional treatment approaches.

  • Meregaglia & Cairns (2017) “A Systematic Literature Review of Health State Utility Values in Head and Neck Cancer”, Health Qual Life Outcomes. 15(1):174. DOI: 10.1186/s12955-017-0748-z. PMID: 28865475. Systematic review of 28 studies and 346 health state utility values in HNC; recommends EQ-5D as the preferred instrument. Confirms that evidence for recurrent and metastatic HNC states is sparse: the only patient-reported EQ-5D for recurrence found in the review is a median of 0.70 (del Barco et al. 2016), from a palliative-intent recurrent/metastatic cohort — a different clinical context from S7 (locoregional recurrence with potential salvage intent) in this model. The S7 utility of 0.55 is a modeller’s assumption informed by this evidence gap, not a directly reported value from this review.

  • Noel et al. (2015) “Comparison of Health State Utility Measures in Patients With Head and Neck Cancer”, JAMA Otolaryngol Head Neck Surg. 141(8):696–703. Prospective, cross-sectional, and longitudinal study of 100 patients with squamous cell carcinoma of the upper aerodigestive tract. Mean EQ-5D-5L = 0.82 three months post-treatment with no evidence of recurrence. Primary source for NED (S6) utility value.

  • Ramaekers et al. (2011) “The Impact of Late Treatment-Toxicity on Generic Health-Related Quality of Life in Head and Neck Cancer Patients after Radiotherapy”, Oral Oncol. 47(8):768–774. DOI: 10.1016/j.oraloncology.2011.05.012. Multi-centre cross-sectional survey; EQ-5D measured in HNC patients at least 6 months post-RT (late-effects survivorship cohort, not active treatment). Reports xerostomia and dysphagia as independent predictors of reduced utility post-RT. Retained as contextual reference for late RT toxicity; not the primary source for S4 (PORT) utility — that cohort does not represent patients during active adjuvant radiotherapy.

  • Sprave et al. (2022) “Patient Reported Outcomes Based on EQ-5D-5L Questionnaires in Head and Neck Cancer Patients: A Real-World Study”, BMC Cancer 22:1236. DOI: 10.1186/s12885-022-10346-4. PMID: 36447175. Freiburg University; n = 366 H&N cancer patients undergoing modern RT; prospective real-world PRO study; mean EQ-5D-5L at end of RT = 0.830 (SD not reported for full cohort). Primary source for S2 (Definitive RT) and S4 (PORT) utility weights = 0.83. Corroborated by Sprave et al. (2020, n=49): mean EQ-5D at RT completion = 0.828.

  • Sprave et al. (2020) “Characterization of Health-Related Quality of Life Based on the EQ-5D-5L Questionnaire in Head-and-Neck Cancer Patients Undergoing Modern Radiotherapy”, Expert Rev Pharmacoecon Outcomes Res. 20(6):673–682. DOI: 10.1080/14737167.2020.1823220. PMID: 32912005. Freiburg University Medical Center; n = 49 H&N cancer patients (57% definitive, 41% adjuvant RT); mean EQ-5D-5L at RT completion = 0.828 (SD 0.16). Corroborates Sprave 2022; supports S4 utility revision. Note: no significant permanent HRQOL deterioration observed at 6-month follow-up.

  • Truong et al. / RTOG 0522 (2017) “Quality of Life and Performance Status from a Substudy Conducted Within a Prospective Phase 3 Randomized Trial of Concurrent Accelerated Radiation Plus Cisplatin With or Without Cetuximab for Locally Advanced Head and Neck Carcinoma: NRG Oncology RTOG 0522”, Int J Radiat Oncol Biol Phys. 97(4):687–699. PMC5303682. PMID: 27727066. DOI: 10.1016/j.ijrobp.2016.08.003. Prospective phase III RCT, n = 818 analyzable patients (oropharynx, hypopharynx, larynx; Stage III–IV); EQ-5D-3L collected at pretreatment, last 2 weeks of treatment, 3 months, and annually. Reported EQ-5D values: baseline ~0.79 (CIS 0.78, CET/CIS 0.80); at 3 months from treatment start ~0.775 (CIS 0.78, CET/CIS 0.77); at 1 year ~0.84. Note: the EQ-5D value for the ‘within last 2 weeks of treatment’ time point was collected but not reported in the paper — the authors identified omission of an acute FACT-HN assessment as a study limitation. The 3-month value (0.775) is used as the closest available proxy for active-treatment utility in S5 (POCRT). Secondary limitation: cohort is definitively treated oropharynx/larynx (not postoperative oral cavity POCRT); adopted as best available CRT utility source. Comparison with Sprave et al. (2022): the Sprave 2022 adjuvant CRT cohort (baseline HI = 0.849, stable to RT completion; CRT vs RT-alone not significantly different at RT completion, p = 0.624) would support a value of ~0.83 for S5. The RTOG 0522-derived value of 0.775 is retained as a conservative estimate reflecting the higher toxicity of POCRT (high-dose cisplatin, post-surgical) relative to the mixed Sprave 2022 cohort. A sensitivity analysis using 0.83 would reduce the Surgery vs. Def RT QALY difference by ~0.009 QALYs. Primary source for S5 (POCRT) utility = 0.775. Not the primary source for S2 (RT alone); see Sprave et al. (2022).

To leave a comment for the author, please follow the link and comment on their blog: R Works.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Stage II OSCC — Health Economics Model]]>
400433
Why Most Time Series Models Fail Before They Start https://www.r-bloggers.com/2026/04/why-most-time-series-models-fail-before-they-start/ Wed, 15 Apr 2026 21:00:00 +0000 https://mfatihtuzen.netlify.app/posts/2026-04-16_timeseries_stationary/

1 A model can run and still be fundamentally wrong
Many time series models fail before they even begin. Not because the software crashes. Not because the code is wrong. But because the data entering the model violate one of the most impor...

Continue reading: Why Most Time Series Models Fail Before They Start]]>
[social4i size="small" align="align-left"] -->
[This article was first published on A Statistician's R Notebook, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

1 A model can run and still be fundamentally wrong

Many time series models fail before they even begin. Not because the software crashes. Not because the code is wrong. But because the data entering the model violate one of the most important assumptions in time series analysis: stationarity.

This is where many analyses quietly go off the rails. A model is estimated, forecasts are produced, coefficients look serious, and the graphs appear convincing. But the model may be chasing a moving target rather than learning a stable data-generating mechanism.

In this post, we will work with a real macroeconomic series rather than a toy example. The data come from the Consumer Price Index for All Urban Consumers: All Items (CPIAUCSL), published by the U.S. Bureau of Labor Statistics and distributed through FRED. FRED describes CPIAUCSL as a monthly, seasonally adjusted price index and notes that percent changes in the index are commonly used to measure inflation.

Because live API access may fail in some institutional or offline environments, this workflow uses a locally downloaded CSV file instead of fetching the series on the fly. You can download the file directly from the CPIAUCSL page on FRED.

The goal is simple: show why raw time series levels often mislead us, what stationarity really means, and why transformations such as differencing and log-differencing are not cosmetic tricks but conceptual necessities.

2 What stationarity really means

In informal language, a stationary series is one whose behavior does not drift in a systematic way over time. More formally, a weakly stationary process () satisfies three conditions:

The first condition says the mean does not change over time. The second says the variance is constant. The third says the covariance between observations depends only on the lag (k), not on calendar time itself.

This matters because a large part of classical time series modeling is built on the idea that the stochastic structure is stable. When that structure is drifting, many familiar tools become unreliable or at least much harder to interpret. A trending series can generate strong autocorrelation even when the underlying dynamic structure is weak. A persistent upward path can trick the analyst into seeing “model fit” where the model is merely inheriting inertia from the level of the series.

Put differently: without stationarity, a model may explain movement without actually explaining the mechanism.

3 Load the CPI data from a CSV file

Download the CSV file for CPIAUCSL from the official FRED series page and save it in your working directory with the name CPIAUCSL.csv. The file typically includes the columns observation_date and CPIAUCSL. FRED is the distribution platform, while the source agency for the series is the U.S. Bureau of Labor Statistics.

library(readr)
library(dplyr)
library(ggplot2)
library(tibble)
library(zoo)
library(scales)
library(patchwork)
library(tseries)

cpi_tbl <- read_csv("CPIAUCSL.csv", show_col_types = FALSE) %>%
  transmute(
    date = as.Date(observation_date),
    cpi  = as.numeric(CPIAUCSL)
  ) %>%
  arrange(date) %>%
  filter(!is.na(date), !is.na(cpi))

cpi_tbl %>% slice_head(n = 5)
# A tibble: 5 × 2
  date         cpi
  <date>     <dbl>
1 1947-01-01  21.5
2 1947-02-01  21.6
3 1947-03-01  22  
4 1947-04-01  22  
5 1947-05-01  22.0

The line filter(!is.na(date), !is.na(cpi)) is important. If your CSV has an NA for a month such as October 2025, that observation is safely excluded from the analysis instead of silently breaking the workflow.

4 Start with the visual story, not the test statistic

In time series analysis, the first serious diagnostic is often visual rather than formal. That is not because tests are unimportant. It is because plots let us see the basic character of the data before we start compressing everything into a p-value.

If a series has a visible trend, changing volatility, sudden level shifts, or unusual gaps, that already tells us something about whether a stationary model is likely to behave well.

4.1 The raw CPI level

p_level <- ggplot(cpi_tbl, aes(x = date, y = cpi)) +
  geom_line(linewidth = 0.9, color = "#1B4965") +
  labs(
    title = "U.S. CPI (CPIAUCSL): level series",
    subtitle = "Monthly, seasonally adjusted index from FRED",
    x = NULL,
    y = "Index"
  ) +
  scale_y_continuous(labels = label_number()) +
  theme_minimal(base_size = 12)

p_level

Even before applying a formal statistical test, the visual pattern already tells us something important. The CPI level series does not oscillate around a stable mean; instead, it follows a persistent upward path over time. This alone raises an immediate warning against modeling the raw level series as if it were stationary.

The graph also suggests that the increase is not perfectly uniform across the entire sample. In some periods, the slope becomes steeper, indicating faster price growth, while in others the series evolves more gradually. In other words, the series appears to contain not only a long-run trend but also changes in inflation dynamics over time.

This is precisely why visual inspection should be the first step in time series analysis. Before looking at test statistics or fitting a model, we should ask a simpler question: does the series look like it fluctuates around a constant level? In this case, the answer is clearly no.

A smooth and steadily rising curve may look statistically innocent at first glance, but in practice it is often a sign that the raw series is carrying trend information that must be addressed before modeling.

4.2 Rolling summaries to deepen the visual diagnosis

A single line plot is useful, but local summaries make the visual argument sharper. Below, I compute a 24-month rolling mean and rolling standard deviation.

cpi_roll <- cpi_tbl %>%
  mutate(
    roll_mean_24 = zoo::rollmean(cpi, k = 24, fill = NA, align = "right"),
    roll_sd_24   = zoo::rollapply(cpi, width = 24, FUN = sd, fill = NA, align = "right")
  )

p_roll_mean <- ggplot(cpi_roll, aes(date, roll_mean_24)) +
  geom_line(linewidth = 0.9, color = "#2A9D8F") +
  labs(
    title = "24-month rolling mean of CPI",
    x = NULL,
    y = "Rolling mean"
  ) +
  theme_minimal(base_size = 12)

p_roll_sd <- ggplot(cpi_roll, aes(date, roll_sd_24)) +
  geom_line(linewidth = 0.9, color = "#E76F51") +
  labs(
    title = "24-month rolling standard deviation of CPI",
    x = NULL,
    y = "Rolling SD"
  ) +
  theme_minimal(base_size = 12)

p_roll_mean / p_roll_sd

If the series were approximately stationary, we would expect these rolling statistics to fluctuate around relatively stable levels over time. In particular, the rolling mean should remain close to a constant value, and the rolling standard deviation should not exhibit systematic shifts.

However, the evidence here points in the opposite direction. The rolling mean shows a clear and persistent upward drift, reinforcing what we observed in the raw series: the central tendency is not stable, but evolving over time.

The rolling standard deviation tells a more nuanced story. While it remains relatively moderate for long periods, there are noticeable fluctuations and, more importantly, a pronounced spike in recent years. This indicates that the variability of the series is not constant and may respond to underlying economic conditions or shocks.

Taken together, these two plots suggest that the series violates the key assumptions of stationarity—both in terms of mean and variance. While rolling statistics alone do not formally prove non-stationarity, they provide strong visual evidence that the raw series is not suitable for direct modeling without transformation.

5 Why raw CPI levels are a good example

CPI is ideal for illustrating this problem because the level series typically trends upward over time. That is not a defect in the data; it is what a price index often does. But from a modeling perspective, it creates trouble.

If the level keeps drifting upward, then the mean is not constant. If the size of movements changes as the level rises, the variance may also appear unstable. In such a setting, fitting a model directly to the raw series can mix long-run inflationary drift with short-run dynamic behavior.

Economically, analysts are usually not interested in the index level itself as much as they are interested in inflation, that is, the rate at which the price level changes. Statistically, this is convenient too, because transforming the series from levels to changes often brings it closer to stationarity.

6 A statistical check: the Augmented Dickey-Fuller test

Visual diagnosis matters, but it is usually not enough. A commonly used statistical tool is the Augmented Dickey-Fuller (ADF) test, which tests for the presence of a unit root. In practical terms, the test is often used to assess whether a series behaves like a non-stationary process with persistent stochastic trend.

The null hypothesis of the ADF test is that the series has a unit root. That means the burden of proof is asymmetric:

  • a large p-value means we do not have strong evidence against non-stationarity,
  • a small p-value means the data are more consistent with stationarity.

That distinction is easy to say and easy to misuse. Failing to reject the null is not the same thing as proving a series is non-stationary beyond all doubt. It simply means the test did not find enough evidence against the unit-root view.

Let us start with the raw CPI level.

adf_level <- tseries::adf.test(cpi_tbl$cpi)
adf_level
    Augmented Dickey-Fuller Test

data:  cpi_tbl$cpi
Dickey-Fuller = -0.1813, Lag order = 9, p-value = 0.99
alternative hypothesis: stationary

The Augmented Dickey–Fuller (ADF) test provides a formal way to assess whether the series contains a unit root. The null hypothesis of the test is that the series is non-stationary (i.e., it has a unit root), while the alternative hypothesis is stationarity.

In this case, the p-value is extremely high (p ≈ 0.99), meaning that we fail to reject the null hypothesis. In other words, there is no statistical evidence to support that the CPI level series is stationary.

However, this result should not be interpreted in isolation. Statistical tests and visual diagnostics should complement each other. The high p-value is entirely consistent with what we observed earlier: the series exhibits a strong upward trend and does not fluctuate around a constant mean.

Taken together, both the visual evidence and the ADF test point to the same conclusion — the raw CPI level behaves more like a drifting (unit root) process than a stationary one. This reinforces the need for transforming the series before attempting any meaningful modeling.

7 The first rescue: differencing

One of the oldest and most important ideas in time series analysis is that differencing can remove certain forms of trend. The first difference is

This transformation asks a different question. Instead of modeling the level, we model the change from one period to the next.

cpi_diff_tbl <- cpi_tbl %>%
  mutate(diff_cpi = c(NA, diff(cpi))) %>%
  filter(!is.na(diff_cpi))

p_diff <- ggplot(cpi_diff_tbl, aes(x = date, y = diff_cpi)) +
  geom_line(linewidth = 0.8, color = "#6D597A") +
  labs(
    title = "First difference of CPI",
    subtitle = "Absolute month-to-month change in the index",
    x = NULL,
    y = expression(Delta*CPI)
  ) +
  theme_minimal(base_size = 12)

p_diff

Taking the first difference removes a large part of the visible trend in the series. Compared to the raw CPI level, the differenced series fluctuates much more around a relatively stable center, which is an encouraging sign from a modeling perspective.

However, differencing does not fully solve the problem. While it helps stabilize the mean, the variability of the series still appears to change over time, particularly in more recent periods where larger fluctuations are observed. This suggests that the series may still violate the constant variance assumption.

There is also a more subtle but important issue: interpretation. The first difference represents absolute changes in the index, not relative ones. In macroeconomic data, a one-point increase in CPI does not carry the same meaning when the index is around 100 versus when it exceeds 300. As the scale of the series grows, the same absolute change reflects a smaller proportional movement.

In other words, differencing improves the statistical properties of the series, but it does not yet provide a fully consistent or interpretable measure of change. This is why we often go one step further and consider transformations based on relative (percentage) changes.

8 The more meaningful rescue: log differences

This is where the log transformation becomes more than a technical detail. Consider

For moderate changes, this is approximately the proportional growth rate. In the CPI context, it moves us from the language of index levels toward the language of inflation.

That shift is both statistical and economic.

cpi_log_tbl <- cpi_tbl %>%
  mutate(
    log_cpi = log(cpi),
    dlog_cpi = c(NA, diff(log_cpi)),
    annualized_inflation_pct = 1200 * dlog_cpi,
    yoy_inflation_pct = 100 * (cpi / lag(cpi, 12) - 1)
  )

p_dlog <- cpi_log_tbl %>%
  filter(!is.na(annualized_inflation_pct)) %>%
  ggplot(aes(x = date, y = annualized_inflation_pct)) +
  geom_line(linewidth = 0.8, color = "#D62828") +
  labs(
    title = "Monthly log-difference of CPI (annualized)",
    subtitle = "A close cousin of short-run inflation",
    x = NULL,
    y = "Percent"
  ) +
  theme_minimal(base_size = 12)

p_yoy <- cpi_log_tbl %>%
  filter(!is.na(yoy_inflation_pct)) %>%
  ggplot(aes(x = date, y = yoy_inflation_pct)) +
  geom_line(linewidth = 0.8, color = "#F4A261") +
  labs(
    title = "Year-over-year CPI inflation",
    subtitle = "A slower-moving inflation measure",
    x = NULL,
    y = "Percent"
  ) +
  theme_minimal(base_size = 12)

p_dlog / p_yoy

Two key insights emerge from these transformations.

First, moving from levels to rates of change fundamentally improves interpretability. The log-difference series represents approximate percentage changes — in this context, a close proxy for short-run inflation. This is the quantity economists actually care about. A 1% increase has the same meaning regardless of whether the index is at 100 or 300, making comparisons over time much more meaningful.

Second, the transformation has a clear impact on the statistical properties of the series. Compared to the raw level and even the first difference, the log-differenced series fluctuates more consistently around a stable mean. While it still exhibits volatility spikes and occasional outliers, the overall behavior is much closer to what we would expect from a stationary process.

The comparison between the two plots is also instructive. The monthly log-difference captures short-term fluctuations and reacts quickly to shocks, while the year-over-year inflation series smooths out this noise and highlights longer-term inflation dynamics. Both are useful, but they answer different questions.

To put it bluntly: you did not just transform the data — you changed the question.

9 Re-test after transformation

Let us apply the ADF test again, this time to the log-differenced series.

adf_dlog <- cpi_log_tbl %>%
  filter(!is.na(dlog_cpi)) %>%
  pull(dlog_cpi) %>%
  tseries::adf.test()

adf_dlog
    Augmented Dickey-Fuller Test

data:  .
Dickey-Fuller = -4.3862, Lag order = 9, p-value = 0.01
alternative hypothesis: stationary

The contrast between the two ADF test results is striking and highly informative.

For the raw CPI level, we failed to reject the null hypothesis of a unit root, indicating that the series behaves as a non-stationary process. In contrast, for the log-differenced series, the p-value drops to around 0.01, allowing us to reject the null hypothesis and conclude that the transformed series is consistent with stationarity.

This shift is not just a technical detail — it reflects a fundamental change in how the data behaves. The transformation has effectively removed the persistent trend component and brought the series closer to a stable statistical structure.

That said, the test result should always be interpreted alongside the visual evidence. The ADF test provides formal confirmation, but the intuition comes from the plots. What we saw visually — a drifting level series versus a mean-reverting transformed series — is now supported by statistical testing.

In essence, the workflow comes full circle:
we start with a problematic series, diagnose the issue visually, apply a transformation, and then verify the improvement formally.

This is the core of time series thinking.

10 A subtle but crucial point: transformation changes interpretation

This is the point where many explanations remain superficial.

When you difference a series, you are not merely “cleaning” it — you are redefining the object of analysis.

  • Modeling CPI levels asks how the price index evolves over time.
  • Modeling first differences asks how much the index changes from one period to the next.
  • Modeling log differences asks about proportional change, which is directly linked to inflation.

These are not equivalent statistical questions, and they are certainly not equivalent economic questions.

This is why time series preprocessing should never be treated as a mechanical step. Every transformation involves a trade-off: it improves certain statistical properties while simultaneously altering the meaning of the data.

Understanding that trade-off is not optional — it is central to sound time series analysis.

11 Why this matters for ARIMA-style modeling

ARIMA models are often presented as if the workflow were mechanical: inspect the series, difference if needed, identify orders, estimate parameters, check residuals, and forecast. While this workflow is useful, it can create the misleading impression that differencing is simply a procedural step — a box to tick.

It is not.

Differencing is a deliberate modeling choice. Its purpose is to separate persistent, trend-like behavior from shorter-run dynamics. If you skip it when it is needed, your model may inherit non-stationarity and produce unreliable or misleading inference. If you apply it excessively, you risk removing meaningful structure and end up modeling noise.

The real question, therefore, is not “Should I difference?” but rather:
What feature of the data am I trying to stabilize, and what question do I want the model to answer?

12 A compact comparison

13 A compact comparison

Series version What it represents Typical issue When it helps (and when it does not)
CPI level The price index itself Strong trend, likely unit root Poor starting point for stationary modeling
First difference Absolute period-to-period change Still scale-dependent Reduces trend, but interpretation remains limited
Log difference Approximate proportional change May still show volatility bursts More suitable for modeling inflation-type dynamics
Year-over-year change Annual percentage change Smoother, less responsive Useful for communication, less suited for short-run analysis

14 Common mistakes

Most mistakes in time series analysis are not computational — they are conceptual.

Mistake 1: fitting models directly to raw levels because the plot “looks smooth.”
Smoothness is not stationarity. A strong trend can produce visually smooth series that are statistically problematic.

Mistake 2: treating differencing as a harmless default.
Differencing changes the meaning of the data. It may improve statistical properties while quietly reducing interpretability if applied without care.

Mistake 3: relying on a single test result as final truth.
The ADF test is useful, but it is only one piece of evidence. Visual inspection, domain knowledge, structural breaks, and alternative tests all matter.

Mistake 4: forgetting the economics.
In the case of CPI, the focus is typically on inflation, not the index level itself. A good transformation is one that improves statistical validity while remaining aligned with the economic question.

Taken together, these mistakes point to a simple lesson:
time series analysis is not about applying steps — it is about making informed choices.

15 Final thoughts

Most time series models do not fail because we cannot estimate them. They fail because we model the wrong object.

The raw CPI series is a clear reminder that not every observed series is ready for modeling. A trending index is rarely an appropriate input for a stationary model. Once we difference — and especially log-difference — the data, the series becomes more interpretable, more stable, and much closer to the type of process that classical time series methods are designed to handle.

So before asking whether your model is sophisticated enough, ask a more fundamental question:

Am I modeling a stable process — or just chasing drift?

In many cases, the answer to this question matters far more than whether you choose AR(1), ARIMA(1,1,1), or any other fashionable specification.

16 References and further reading

16.1 Data sources


16.2 Core time series references

  • Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time Series Analysis: Forecasting and Control. Wiley.

  • Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.).
    https://otexts.com/fpp3/

  • Hamilton, J. D. (1994). Time Series Analysis. Princeton University Press.


16.3 Stationarity and unit root testing

  • Dickey, D. A., & Fuller, W. A. (1979). Distribution of the estimators for autoregressive time series with a unit root. Journal of the American Statistical Association.

  • Said, S. E., & Dickey, D. A. (1984). Testing for unit roots in autoregressive-moving average models of unknown order. Biometrika.


16.4 Transformations and interpretation

  • Stock, J. H., & Watson, M. W. (2019). Introduction to Econometrics. Pearson.

  • Tsay, R. S. (2010). Analysis of Financial Time Series. Wiley.


16.5 Practical R resources


16.6 Suggested next steps for readers

If you want to go deeper, consider exploring:

  • Unit root tests beyond ADF (KPSS, Phillips–Perron)
  • Structural breaks and regime changes
  • Seasonal differencing and SARIMA models
  • Volatility modeling (ARCH/GARCH)

These topics build directly on the ideas discussed in this article and will deepen your understanding of time series behavior.

To leave a comment for the author, please follow the link and comment on their blog: A Statistician's R Notebook.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Why Most Time Series Models Fail Before They Start]]>
400599
logrittr: A Verbose Pipe Operator for Logging dplyr Pipelines https://www.r-bloggers.com/2026/04/logrittr-a-verbose-pipe-operator-for-logging-dplyr-pipelines/ Wed, 15 Apr 2026 08:05:50 +0000 https://guillaumepressiat.github.io/blog/2026/04/logrittr

dplyr verbs are descriptive: let’s make them more verbose!

Yet another pipe for R.

Motivation

In SAS, every DATA step prints a log:

NOTE: There were 120000 observations read from WORK.SALES.
NOTE: 7153 observations wer...

Continue reading: logrittr: A Verbose Pipe Operator for Logging dplyr Pipelines]]>
[social4i size="small" align="align-left"] -->
[This article was first published on Guillaume Pressiat, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


dplyr verbs are descriptive: let’s make them more verbose!

Yet another pipe for R.






Motivation

In SAS, every DATA step prints a log:

NOTE: There were 120000 observations read from WORK.SALES.
NOTE: 7153 observations were deleted.
NOTE: The data set WORK.RESULT has 112847 observations and 11 variables.

R’s dplyr pipelines are silent. logrittr fills that gap with %>=%, a drop-in pipe that logs row counts, column counts, added/dropped columns, and timing at every step, with no function masking.

With Fira Code ligatures, %>=% renders as a single wide arrow visually similar to %>% with an underline added, like a subtitle or, say, to read between the lines of a pipeline (what happened).

Multiples contexts

Things happens:

NOTE: There were 120000 observations read from WORK.SALES.
NOTE: 120000 observations were deleted.
NOTE: The data set WORK.RESULT has 0 observations and 11 variables.

“It’s here we’ve lost all rows in script execution”.

Pro

Reading this a long time after execution of a script helps you see:

  • what happened at each stage of data processing without having to rerun the code, for example in a production environment where the input data is constantly changing
  • monitor key processes
  • Make sure you can explain what happened (an audit, for example)

In professional contexts it’s often needed.

Educational

This will also be clearer thanks to a console log for those with little experience with the tidyverse: people who are taking their first steps in programming by following a tutorial or teaching themselves.

Installation

install.packages('logrittr', repos = 'https://guillaumepressiat.r-universe.dev')

# or from github
# devtools::install_github("GuillaumePressiat/logrittr")

See github or r-universe.

Usage

library(logrittr)
library(dplyr)

iris %>=%
  as_tibble() %>=%
  filter(Sepal.Length < 5)  %>=%
  mutate(rn = row_number()) %>=%
  semi_join(
    iris %>% as_tibble() %>=%
      filter(Species == "setosa"),
    by = "Species"
  )  %>=%
  group_by(Species) %>=%
  summarise(n = n_distinct(rn))
── iris  [rows:       150  cols:    5] ─────────────────────────────────────────────────────
ℹ as_tibble()                            rows:       150 +0        cols:    5 +0    [   0.0 ms]
ℹ filter(Sepal.Length < 5)               rows:        22 -128      cols:    5 +0    [   3.0 ms]
ℹ mutate(rn = row_number())              rows:        22 +0        cols:    6 +1    [   1.0 ms]
  added: rn
ℹ > filter(Species == "setosa")          rows:        50 -100      cols:    5 +0    [   1.0 ms]
ℹ semi_join(iris %>% as_tibble() %>=%    rows:        20 -2        cols:    6 +0    [   5.0 ms]
  filter(Species == "setosa"), by =
  "Species")
ℹ group_by(Species)                      rows:        20 +0        cols:    6 +0    [   3.0 ms]
ℹ summarise(n = n_distinct(rn))          rows:         1 -19       cols:    2 -4    [   2.0 ms]
  dropped: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, rn
  added: n

Screenshot


library(dplyr)
library(logrittr)

logrittr_options(lang = "en", big_mark = ",", wrap_width = NULL, max_cols = 3)

nycflights13::flights %>=% 
  as_tibble() %>=%
  group_by(year, month, day) %>=% 
  count() %>=% 
  tidyr::pivot_wider(values_from = "n", names_from = "day") %>=% 
  glimpse()

tidylog is a really neat package that gives me motivation for this one. tidylog works by masking dplyr functions, not ideal te me.

Anyway this also was a moment for me to test a new programmer tool that is used a lot for programming at this time.

logrittr uses a custom pipe operator and never touches the dplyr namespace. Its console output is colorful and informative thanks to the cli package.

Working with lumberjack

If you already know the lumberjack package, compatibility is available with logrittr (timings are approximates).

Calling logrittr_logger$new():

library(lumberjack)
library(dplyr)

l <- logrittr_logger$new(verbose = TRUE)
logfile <- tempfile(fileext=".-r.log.csv")

iris %L>%
   start_log(log = l, label = "iris step") %L>%
   as_tibble() %L>%
   filter(Sepal.Length < 5) %L>%
   mutate(rn = row_number()) %L>%
   group_by(Species) %L>%
   summarise(n = n_distinct(rn)) %L>%
   dump_log(file=logfile, stop = FALSE)
   

mtcars %>% 
  start_log(log = l, label = "mtcars step") %L>%
   count() %L>%
   dump_log(file=logfile, stop = TRUE)


logdata <- read.csv(logfile)

Will write logrittr log content of multiple data steps in the same csv file.

Limitations

  • Like tidylog, logrittr only works with dplyr pipelines on R data.frames (in memory) and is not able to do so with dbplyr pipelines from databases (remote/lazy table).

  • Join cardinalities nicely done in tidylog are difficult to have from the pipe as join is already done, at this time we only show N row and N col evolution (before / after).

  • Yes it’s another pipe, not ideal. We can dream of a with_logging(TRUE) context that will activate behaviour of logrittr pipe in |> or in %>%.

Take another pipe for a spin

logrittr prioritizes the user experience with a structured and colorful display in the console.

For now, this package is just a proof of concept that gave me a chance to experiment a bit with the cli package and few other things. But I think there’s a need for that in R, in a specific area where SAS outputs are so informative.

To leave a comment for the author, please follow the link and comment on their blog: Guillaume Pressiat.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: logrittr: A Verbose Pipe Operator for Logging dplyr Pipelines]]>
400575
Dealing with correlation in designed field experiments: part II https://www.r-bloggers.com/2026/04/dealing-with-correlation-in-designed-field-experiments-part-ii-4/ Wed, 15 Apr 2026 00:00:00 +0000 https://www.statforbiology.com/2026/correlation/

With field experiments, studying the correlation between the observed traits may not be an easy task. For example, we can consider a genotype experiment, laid out in randomised complete blocks, with 27 wheat genotypes and three replicates, where sev...

Continue reading: Dealing with correlation in designed field experiments: part II]]>
[social4i size="small" align="align-left"] -->
[This article was first published on R on Fixing the bridge between biologists and statisticians, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

With field experiments, studying the correlation between the observed traits may not be an easy task. For example, we can consider a genotype experiment, laid out in randomised complete blocks, with 27 wheat genotypes and three replicates, where several traits were recorded, including yield (Yield) and weight of thousand kernels (TKW). We might be interested in studying the correlation between those two traits, but we would need to face two fundamental problems:

  1. the concept of correlation in such a setting is not unique, as we might either consider the correlation between the plot measurements, or the correlation between the residuals or the correlation between genotype means or the correlation between block means;
  2. the experimental units are not independent, but they are grouped by genotype and block, which invalidate all inferences based on the independence assumption.

I have dealt with these two problems (particularly the first one) in a previous post, where I gave a solution based on traditional methods of data analyses.

In this post, I would like to present a more advanced solution, that was advocated by Hans-Peter Piepho in a relatively recent manuscript (Piepho, 2018). Such a solution is based on mixed models and it was implemented in SAS, by using PROC MIXED. I spent a few hours ‘transporting’ those models in R, which turned out to be a difficult task, although, in the end, I seem to have came to an acceptable solution, which I would like to share here.

First of all, we can load the ‘WheatQuality’ dataset, that is available in the ‘statforbiology’ package; it consists of 81 records (plots) of 6 variables, i.e. the Genotype and Block factors, as well as the four responses height, TKW, weight per hectolitre and yield. The code below loads the necessary packages, the data and transforms the numeric variable ‘Block’ into a factor.

rm(list = ls())
library(statforbiology)
library(dplyr)
library(tidyr)
library(sommer)
library(nlme)
#
# Loading data
dataset <- getAgroData("WheatQuality") |>
  mutate(Block = factor(Block),
         Genotype = factor(Genotype))
head(dataset)
##     Genotype Block Height  TKW Whectol Yield
## 1 arcobaleno     1     90 44.5    83.2 64.40
## 2 arcobaleno     2     90 42.8    82.2 60.58
## 3 arcobaleno     3     88 42.7    83.1 59.42
## 4       baio     1     80 40.6    81.8 51.93
## 5       baio     2     75 42.7    81.3 51.34
## 6       baio     3     76 41.1    81.1 47.78

Piepho (2018) showed that, for an experiment like this one, all the correlation coefficients can be estimated by coding a multi-response mixed model, as follows:

\[ Y_{ijk} = \mu_i + \beta_{ik} + \tau_{ij} + \epsilon_{ijk}\]

where \(Y_{ijk}\) is the response for the trait \(i\), the genotype \(j\) and the block \(k\), \(\mu_i\) is the mean for the trait \(i\), \(\beta_{ik}\) is the effect of the block \(k\) and trait \(i\), \(\tau_{ij}\) is the effect of genotype \(j\) for the trait \(i\) and \(\epsilon_{ijk}\) is the residual for the trait \(i\), the genotype \(j\) and the block \(k\).

In the above model, the residuals \(\epsilon_{ijk}\) need to be normally distributed and heteroscedastic, with trait-specific variances. Furthermore, residuals belonging to the same plot (the two observed values for the two traits) need to be correlated (correlation of errors).

Hans-Peter Piepho, in his paper, put forward the idea that the ‘genotype’ and ‘block’ effects for the two traits can be taken as random, which makes sense, because, for this application, we are mainly interested in variances and covariances. Both random effects (for the genotype and for the block terms) need to be heteroscedastic (trait-specific variance components) and there must be a correlation between the two traits.

It should be noted that, for other applications, the genotype and block effects (especially the former) might be better regarded as fixed, but we will not pursue such an idea in this post.

Fitting a bivariate model

To the best of my knowledge, there is no way to fit such a complex model with the ‘nlme’ package and related ‘lme()’ function (I’ll gave a hint later on, for a simpler model). In a previous post at this link, I have given a solution based on the ‘asreml’ package (Butler et al., 2018), but this is a paid option. In more recent times I have discovered the ‘sommer’ package (Covarrubias-Pazaran, 2016), which seems to be very useful and it is suitable to deal with the requirements of this post. The key function of ‘sommer’ is mmer(), and, in order to fit the above model, we need to specify the following components.

  1. The response variables. When we set a bivariate model with ‘sommer’, we can ‘cbind()’ Yield and TKW.
  2. The fixed model, that does not contain any effects, but the intercept (by default, the means for the two effects are separately estimated, as in Piepho, 2018).
  3. The random model, that is composed by the ‘genotype’ and ‘block’ effects. For both, I specified a general unstructured variance covariance matrix, so that we can estimate two different variance components (one per trait) and one covariance component. The resulting coding is ‘~ vsr(usr(Genotype)) + vsr(usr(Block))’.
  4. The residual structure, where the two observations in the same plot are heteroscedastic and correlated. This structure is fitted by default and it does not require any specific coding.

The model call is:

mod.bimix <- mmer(cbind(Yield, TKW) ~ 1,
                   random = ~ vsr(usr(Genotype)) + vsr(usr(Block)),
                   rcov = ~ vsr(units),  
                   data = dataset,
                  verbose = FALSE, dateWarning = FALSE)
bimix.obj <- summary(mod.bimix)
coefs <- bimix.obj$varcomp
coefs
##                           VarComp  VarCompSE     Zratio Constraint
## u:Genotype.Yield-Yield 77.6342608 22.0978545  3.5132035   Positive
## u:Genotype.Yield-TKW   38.8320973 15.0930429  2.5728475   Unconstr
## u:Genotype.TKW-TKW     53.8613303 15.3539585  3.5079768   Positive
## u:Block.Yield-Yield     3.7104682  3.9363372  0.9426195   Positive
## u:Block.Yield-TKW      -0.2428322  1.9074202 -0.1273092   Unconstr
## u:Block.TKW-TKW         1.6684549  1.8343512  0.9095613   Positive
## u:units.Yield-Yield     6.0939217  1.1951482  5.0988836   Positive
## u:units.Yield-TKW       0.1635821  0.7242898  0.2258518   Unconstr
## u:units.TKW-TKW         4.4718011  0.8770118  5.0989065   Positive

The box above shows the results about the variance-covariance parameters. In order to get the correlations, I used the delta method, as implemented in the gnlht() function of the ‘statforbiology’ package (the accompanying package for this blog). First of all I extracted the variance parameters together with the covariance matrix for the variance parameters from the mixed model object. For simplicity, I assigned simple names to the coefficients (V1, V2, … Vn), according to their ordering in model output.

# Correlation between genotype means
coefsVec <- coefs[,1]
vcovMat <- mod.bimix$sigmaSE # Variance-covariance for varcomp
names(coefsVec) <- paste("V", 1:9, sep = "")
gnlht(coefsVec, func = list(~ V2 / (sqrt(V1)*sqrt(V3) )),
      vcov. = as.matrix(vcovMat),
      parameterNames = paste("V", 1:9, sep = ""))
##                       Form  Estimate        SE  Z-value      p-value
## 1 V2/(sqrt(V1) * sqrt(V3)) 0.6005174 0.1306699 4.595684 4.313326e-06
#
# Correlation between block means
gnlht(coefsVec, func = list(~ V5 / (sqrt(V4)*sqrt(V6) ) ),
      vcov. = as.matrix(vcovMat),
      parameterNames = paste("V", 1:9, sep = ""))
##                       Form    Estimate        SE   Z-value   p-value
## 1 V5/(sqrt(V4) * sqrt(V6)) -0.09759658 0.7571256 0.1289041 0.8974335
#
# Correlation of residuals
gnlht(coefsVec, func = list(~ V8 / (sqrt(V7)*sqrt(V9) )),
      vcov. = as.matrix(vcovMat),
      parameterNames = paste("V", 1:9, sep = ""))
##                       Form   Estimate        SE   Z-value   p-value
## 1 V8/(sqrt(V7) * sqrt(V9)) 0.03133619 0.1385421 0.2261854 0.8210572

We see that the estimates are very close to those obtained by using the Pearson’s correlation coefficients (see my previous post). The advantage of this mixed model solution is that we can also test hypotheses in a relatively reliable way. For example, we can look at the Wald tests in the output above to judge about the significance of correlations and conclude that only the genotype means are significantly correlated to one another.

A solution with ‘lme()’

Considering that the block means are not correlated, if we were willing to take the ‘block’ effect as fixed, we could fit a bivariate mixed model also with the ‘nlme’ package and the function lme() (Pinheiro and Bates, 2000). However, we should cast the model as a ‘univariate’ model.

To this aim, the two variables (Yield and TKW) need to be piled up and a new factor (‘Trait’) needs to be introduced to identify the observations for the two traits. Another factor is also necessary to identify the different plots, i.e. the observational units. To perform such a restructuring, I used the pivot_longer() function in the ‘tidyr’ package (Wickham et al., 2024) and assigned the name ‘Y’ to the response variable, that is now composed by the two original variables Yield and TKW, one on top of the other.

dataset$Plot <- 1:81
mdataset <- dataset |>
  select(-Whectol, -Height) |>
  pivot_longer(names_to = "Trait", values_to = "Y", cols = c("Yield", "TKW")) |>
  mutate(Trait = factor(Trait))
head(mdataset)
## # A tibble: 6 × 5
##   Genotype   Block  Plot Trait     Y
##   <fct>      <fct> <int> <fct> <dbl>
## 1 arcobaleno 1         1 Yield  64.4
## 2 arcobaleno 1         1 TKW    44.5
## 3 arcobaleno 2         2 Yield  60.6
## 4 arcobaleno 2         2 TKW    42.8
## 5 arcobaleno 3         3 Yield  59.4
## 6 arcobaleno 3         3 TKW    42.7
tail(mdataset)
## # A tibble: 6 × 5
##   Genotype Block  Plot Trait     Y
##   <fct>    <fct> <int> <fct> <dbl>
## 1 vitromax 1        79 Yield  54.4
## 2 vitromax 1        79 TKW    41.6
## 3 vitromax 2        80 Yield  51.0
## 4 vitromax 2        80 TKW    43.6
## 5 vitromax 3        81 Yield  48.8
## 6 vitromax 3        81 TKW    43.1

The fixed model is:

Y ~ Trait*Block

In order to introduce a totally unstructured variance-covariance matrix for the random effect, I used the ‘pdMat’ construct:

random = list(Genotype = pdSymm(~Trait - 1))

Relating to the residuals, heteroscedasticity can be included by using the ‘weight()’ argument and the ‘varIdent’ variance function, which allows a different standard deviation per trait:

weight = varIdent(form = ~1|Trait)

I also allowed the residuals to be correlated within each plot, by using the ‘correlation’ argument and specifying a general ‘corSymm()’ correlation form. Plots are nested within genotypes, thus I used a nesting operator, as follows:

correlation = corSymm(form = ~1|Genotype/Plot)

The final model call is:

mod <- lme(Y ~ Trait*Block, 
                 random = list(Genotype = pdSymm(~Trait - 1)),
                 weight = varIdent(form=~1|Trait), 
                 correlation = corCompSymm(form=~1|Genotype/Plot),
                 data = mdataset)

Retrieving the variance-covariance matrices needs some effort, as the function ‘getVarCov()’ does not appear to work properly with this multi-stratum model. First of all, we can retrieve the variance-covariance matrix for the genotype random effect (G) and the corresponding correlation matrix.

#Variance structure for random effects
obj <- mod
G <- matrix( as.numeric(getVarCov(obj)), 2, 2 )
G
##          [,1]     [,2]
## [1,] 53.86053 38.83124
## [2,] 38.83124 77.63402
cov2cor(G)
##           [,1]      [,2]
## [1,] 1.0000000 0.6005096
## [2,] 0.6005096 1.0000000

Second, we can retrieve the ‘conditional’ variance-covariance matrix (R), that describes the correlation of errors:

#Conditional variance-covariance matrix (residual error)
V <- corMatrix(obj$modelStruct$corStruct)[[1]] #Correlation for residuals
sds <- 1/varWeights(obj$modelStruct$varStruct)[1:2]
sds <- obj$sigma * sds #Standard deviations for residuals (one per trait)
R <- t(V * sds) * sds #Going from correlation to covariance
R
##           [,1]      [,2]
## [1,] 6.0939152 0.1634968
## [2,] 0.1634968 4.4718077
cov2cor(R)
##            [,1]       [,2]
## [1,] 1.00000000 0.03131984
## [2,] 0.03131984 1.00000000

The total correlation matrix is simply obtained as the sum of G and R:

Tr <- G + R
cov2cor(Tr)
##          [,1]     [,2]
## [1,] 1.000000 0.555787
## [2,] 0.555787 1.000000

We see that the same results can be obtained by using ‘sommer’ and regarding the block effect as fixed, although the coding is below is much neater!

mod.bimix5 <- mmer(cbind(Yield, TKW) ~ Block,
                   random = ~ vsr(usr(Genotype)),
                   data = dataset,
                  verbose = FALSE, dateWarning = FALSE)
mod.bimix5$sigma
## $`u:Genotype`
##          Yield      TKW
## Yield 77.63425 38.83209
## TKW   38.83209 53.86133
## 
## $units
##           Yield       TKW
## Yield 6.0939217 0.1635824
## TKW   0.1635824 4.4718011

Hope this was useful… should you have any better solutions, I’d be happy to learn them; please, drop me a line at the address below. Thanks for reading and happy coding!

And … don’t forget to check out my new book!

Prof. Andrea Onofri
Department of Agricultural, Food and Environmental Sciences
University of Perugia (Italy)
Send comments to: andrea.onofri@unipg.it

Book cover


References

  1. Butler, D., Cullis, B.R., Gilmour, A., Gogel, B., Thomson, R., 2018. ASReml-r reference manual - version 4. UK.
  2. Covarrubias-Pazaran, G., 2016. Genome-Assisted Prediction of Quantitative Traits Using the R Package sommer. PLOS ONE 11, e0156744. https://doi.org/10.1371/journal.pone.0156744
  3. Piepho, H.-P., 2018. Allowing for the structure of a designed experiment when estimating and testing trait correlations. The Journal of Agricultural Science 156, 59–70.
  4. Pinheiro, J.C., Bates, D.M., 2000. Mixed-effects models in s and s-plus. Springer-Verlag Inc., New York.
  5. Wickham H, Vaughan D, Girlich M (2024). tidyr: Tidy Messy Data.doi:10.32614/CRAN.package.tidyr https://doi.org/10.32614/CRAN.package.tidyr, R package version 1.3.1, https://CRAN.R-project.org/package=tidyr.

This post was originally on 2025-02-10 and updated on 2026-04-15

To leave a comment for the author, please follow the link and comment on their blog: R on Fixing the bridge between biologists and statisticians.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Dealing with correlation in designed field experiments: part II]]>
400601
Programming with LLMs in R & Python https://www.r-bloggers.com/2026/04/programming-with-llms-in-r-python/ Tue, 14 Apr 2026 23:59:00 +0000 https://www.jumpingrivers.com/blog/programming-llms-r-python/

Working with LLMs in Practice
Large Language Models are becoming part of everyday data science work. But using them through chat interfaces is only one part of the picture.
In this upcoming webinar, we focus on how to work with LLMs programmatica...

Continue reading: Programming with LLMs in R & Python]]>
[social4i size="small" align="align-left"] -->
[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Working with LLMs in Practice

Large Language Models are becoming part of everyday data science work. But using them through chat interfaces is only one part of the picture.

In this upcoming webinar, we focus on how to work with LLMs programmatically, using R and Python to integrate them into real workflows and applications.

Secure your place by registering through the webinar registration form

What We’ll Cover

We begin with a short introduction to how LLMs work, including how they are priced, where they perform well, and where they can fall short.

From there, the session moves into practical examples of working with LLMs in code:

  • Sending prompts to an LLM API from R using the {ellmer} package
  • Including additional instructions through system prompts
  • Structuring prompts to return clean, tabular outputs
  • Summarising images and PDFs using LLMs

While the examples will focus primarily on R, we will also briefly explore the {chatlas} package for Python, which offers similar functionality.

Why This Matters

Using LLMs through chat tools is useful for exploration, but it has limits.

For data scientists and developers, the value comes from:

  • Automating repetitive tasks

  • Embedding LLMs into applications and pipelines

  • Generating structured outputs that can be reused downstream

This webinar focuses on that shift, from interactive use to integration in code.

Who Should Attend

This webinar is suitable for:

  • Data scientists working with R or Python

  • Developers interested in integrating AI into applications

  • Teams exploring how to move from experimentation to production

No prior experience with LLM APIs is required, but familiarity with R or Python will be helpful.

Webinar Details

  • Date: 23rd April 2026
  • Time: 1:15 PM (UK time)
  • Location: Online
  • Cost: Free

Speaker

The session will be led by Myles Mitchell, Principal Data Scientist at Jumping Rivers.

If you would like to explore these topics further, our 6-hour course, LLM-Driven Applications with R and Python covers:

  • Building LLM-powered dashboards
  • Setting up retrieval-augmented generation (RAG) pipelines
  • Responsible use of AI

Join Us

LLMs are quickly becoming part of the standard toolkit for data science.

Understanding how to use them programmatically opens up far more possibilities than using them through chat alone.

This session is designed to give you a clear starting point.

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Programming with LLMs in R & Python]]>
400537
Marathon Man II: how to pace a marathon https://www.r-bloggers.com/2026/04/marathon-man-ii-how-to-pace-a-marathon/ Tue, 14 Apr 2026 12:30:00 +0000 https://quantixed.org/?p=3760 It’s often the way. I posted recently about how to pace a marathon and very quickly received feedback that would’ve improved the original post. Oh well, no going back. This is take two. So, we have a dataset of all runners from the 2025 New York City Marathon. We ...
Continue reading: Marathon Man II: how to pace a marathon]]>
[social4i size="small" align="align-left"] -->
[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It’s often the way. I posted recently about how to pace a marathon and very quickly received feedback that would’ve improved the original post. Oh well, no going back. This is take two.

So, we have a dataset of all runners from the 2025 New York City Marathon. We want to know how should you pace a marathon. What is the best strategy?

Determining your optimal pace is complex. There’s the theoretical pace that you can achieve – a mix of biomechanics, physiology and training – but it can be very hard to know what this pace is. Anyway, this theoretical pace is what you could achieve when all goes well. You need to factor in the conditions on the day – how you slept, how you fuel, mental attitude, is it windy? can you get in a group and work with others? and so on. A runner may toe the line in the shape to run a sub 3 h marathon, but by the 30 km mark, the story may be very different.

In the last post, we saw that positive splitting (otherwise known as slowing down) is inevitable. So it seems the best strategy is start out faster than your goal pace, bank some time so that you account for the fade.

A reader responded with this insightful comment:

What I question, though, is whether a (very thorough) analysis of how marathons get run tells us much about how they should be run? This seems to be saying, “Forget about an optimal pace, here’s how to compensate for the sub-optimal pace you’re going to run despite your plans.”

This is correct. Any post hoc analysis like this can only tells us how the marathon was run, not about how they should be run. This is because we don’t know the intention of any runner in the dataset. If we did, then we would know how a runner intended to run the race (i.e. what their pacing strategy was) and then we could ask: did that work out for them?

If only we knew their intention… hmm…

The idea

The sub-3 marathon is one of the big goals in running. That is, trying to run it in less than 3 hours. So we know that there are a bunch of runners in the dataset trying to do just that. We know the finish times too. So by definition, the runners finishing between 02:55:00 and 03:00:00 were the folks shooting for sub-3 and who achieved it, while those finishing between 03:00:00 and 03:05:00 were those who didn’t make it. Sure, there will be some in this window who were hoping for 02:50:00 and failed and some who were hoping to do 03:10:00 and ran amazingly well. But by narrowing the window to 5 min either side of 3 h, we have fewer of those than if we took 10 min either side.

If we assume that runners in the 02:55:00 to 03:05:00 finishing window intended to run for a finish time of 3 h, we can analyse how they paced the marathon and how it worked out for them.

Analysing this window also has the advantage that runners of this calibre know how to pace well, compared to those trying for 03:30:00 or 04:00:00. There’s also plenty of them too given that it is a popular goal.

So let’s take a look. Plots first and then the code below.

Going for sub-3

We’ll use difference from goal pace to visualise runners progress. The goal pace here is ~04:15/km. Below 0 is running ahead of pace (banking time) and running above 0 means being behind schedule.

We colour the runners by whether they made it, sub 3 (red) or failed, went over 3 (blue).

770 runners were sub 3 whereas 628 were over 3. This can be difficult to see, so let’s take a different view.

For both outcomes we have several runners who were clearly shooting for a faster time, but something went wrong and they ended up in our window. They are appear as U-shapes in the difference plots. Rather than remove them, we’ll accept these contaminants and assume that most folks in this window are shooting for a 3 h finish.

We can see different pacing as the race progresses for different runners. Some folks are behind schedule but end up making sub-3, others are ahead of time and fail. To answer our question we need to know: what is the best strategy?

On-pace, positive split, negative split?

We’ll take 10 km as our marker point. It’s almost one-quarter through. Any excitement of the start with all the crowds messing up pacing is done and we can see at this point who is intending to run at what pace. Let’s say that “on pace” is 2 s/km difference from goal pace. So at 10 km, an “on pace” runner could be ± 20 s from where they should be (00:21:20). If the difference is more than 20 s we’ll say they are behind pace, and if it is greater in the other direction, they are ahead of pace.

Knowing this, we can look at the outcome. Of the runners going for 3 h, what was the best strategy?

We can see that most people do not negative split a sub-3 marathon. The majority of people making the goal, run the first 10 km (and indeed most of the race) ahead of goal pace.

There’s a risk here though, going out at faster than goal pace means that you might fail. The yellow traces really show how, at 30-35 km, the race gets very tough and people can slow down significantly. Anyone who has run a marathon will tell you that “the race only starts at the 30 km mark”. It’s where people start to hit the wall and this plot really shows that. These folks could have misjudged their theoretical best pace or just struggled on this occasion.

I find the strategy for success interesting. A lot of advice out there is to start out a marathon at an easier pace and speed up if you can. While it’s true you shouldn’t go too fast and blow up, the advice should be to train to run at more than 2 s ahead of goal pace and try to maintain that.

Tell me the odds

With all the caveats in place, let’s try and get some individual-level probabilities from our population-level data.

We looked at the 10 km point, applying a ± 2 s/km threshold for goal pace, and the behind/ahead classifications. We can do this for every waypoint that we have data for. Now, we can say for a given waypoint: of the runners that were say, ahead of pace, how many finished sub-3 (succeeded) and how many were over-3 (failed). This gives us a probability of success for that strategy at that waypoint. We can then plot these probabilities out.

If your strategy was to go ahead of pace and you were ahead of pace at 5 km, you have a 65% chance of going sub-3. If you are ahead of pace at 30 km, it climbs to a 72% chance. Obviously it keeps climbing to certain success the further the race progresses.

Running at goal pace gives a 50/50 chance of making it if you’re on pace at 5 km. But if you are only on-pace at the halfway point, your chance of success drops to 37%.

If you are behind pace at 10 km, you have a 19% chance of success and this probability drops as the race continues. Eventually, we hit the point where it is not possible to make up the time that’s lost and it is 100% likely that you will fail.

The best strategy?

The best strategy is to go out faster than goal pace and this is what you should train for.

Negative splitting is rare. Slowing down after 30 km is highly likely. Failing to account for this means potentially missing out on your goal.

This message is not too different from the previous post, but we now have some probabilities to back up advice on how the race should be run.

Caveats

Most people running a marathon are first-timers who will run this one race and their goal is to simply finish. Let’s face it, most non-runners have no idea whether your finish time was good/bad/whatever. They will just be impressed that you finished! This post is intended for repeat offenders who strive to improve their time. Maybe the best advice is to just go out there, enjoy running your marathon and not worry about pacing. It’s the best feeling in the world to have achieved it whether it’s your first or fifth.

This analysis is obviously limited to one dataset, the 2025 New York City Marathon. It has a flat profile, so any of the probabilities will likely only apply over a similarly flat course in similar conditions. I also mentioned that we assume a 3 h goal for the runners in the window and we saw how this is not perfect, but it is the best we can do. Obviously, the pacing for other goal times may be different, but we saw in the previous analysis that positive splitting is the most likely scenario regardless of pace.

The code

library(ggplot2)
library(ggtext)
library(dplyr)
library(hms)

## plot styling ----

# qBrand plot styling used. This code should run OK without

my_colours <- c("Sub 3 - Behind" = "#003d5c",
                "Sub 3 - Goal Pace" = "#954e9b",
                "Sub 3 - Ahead" = "#ff6b59",
                "Over 3 - Behind" = "#464c89",
                "Over 3 - Goal Pace" = "#dd4d88",
                "Over 3 - Ahead" = "#ffa600")

my_levels <- c("Sub 3 - Behind",
               "Sub 3 - Goal Pace",
               "Sub 3 - Ahead",
               "Over 3 - Behind",
               "Over 3 - Goal Pace",
               "Over 3 - Ahead")

## data wrangling ----

# load csv file from url
# url <- paste0("https://huggingface.co/datasets/donaldye8812/",
#               "nyc-2025-marathon-splits/resolve/main/",
#               "nyrr_marathon_2025_summary_56480_runners_WITH_SPLITS.csv")
# df <- read.csv(url)

# save locally
# write.csv(df, "Output/Data/nyc_marathon_2025_splits.csv", row.names = FALSE)          

## main script ----

df <- read.csv("Output/Data/nyc_marathon_2025_splits.csv")

times_df <- df %>%
  select(RunnerID, splitCode, time)
runners_df <- df %>%
  select(RunnerName, RunnerID, OverallTime, OverallPlace, Gender,
         Age, City, Country, Bib) %>% 
  unique()
runners_df$OverallTime <- as_hms(runners_df$OverallTime)

# unique pairs of splitCode and distance -- and add distance in km
split_distances <- df %>%
  select(splitCode, distance) %>%
  unique()
split_distances$distance <- c(4.83,5.00,6.44,8.05,9.66,10.00,11.27,12.87,14.48,
                              15.00,16.09,17.70,19.31,20.00,20.92,21.08,22.53,
                              24.14,25.00,25.75,27.36,28.97,30.00,30.58,32.19,
                              33.80,35.00,35.41,37.01,38.62,40.00,40.23,41.84,
                              42.20)

# merge split distances with times_df
times_df <- merge(times_df, split_distances, by = "splitCode", sort = FALSE)

# order the table by RunnerID and then by distance
times_df <- times_df[order(times_df$RunnerID, times_df$distance), ]
row.names(times_df) <- NULL
# time is character, change it
times_df$time <- as_hms(times_df$time)

# make a df of RunnerID, OverallTime, and a new column called Category which is
# "Sub 3" or "Over 3"
category_df <- runners_df %>%
  select(RunnerID, OverallTime) %>% 
  filter(OverallTime > as_hms("02:55:00") & OverallTime <= as_hms("03:05:00")) %>%
  mutate(Category = ifelse(OverallTime <= as_hms("03:00:00"), "Sub 3", "Over 3"))

# merge category_df with times_df to get the pace for each runner in each
# category and drop any rows with NA values
times_df <- merge(times_df, category_df,
                  by = "RunnerID", all.x = TRUE, sort = FALSE) %>%
  filter(!is.na(Category)) %>%
  mutate(on_par = time - (as_hms("03:00:00") /42.19 * distance))

ggplot(times_df, aes(x = distance, y = on_par, group = RunnerID, color = Category)) +
  geom_abline(slope = 0, intercept = 0, linetype = "dashed", color = "black") +
  geom_line(alpha = 0.2) +
  scale_color_manual(values = c("Sub 3" = "#ff6b59", "Over 3" = "#464c89")) +
  labs(title = "Difference from Goal Pace for Sub-3 and Over-3 Runners in NYC Marathon 2025",
       x = "Distance (km)",
       y = "Difference from Goal Pace (seconds)",
       color = "Category") +
  theme_q() +
  guides(colour = guide_legend(override.aes = list(alpha = 1)))
ggsave("Output/Plots/sub_over_3_comparison.png", width = 7, height = 4, dpi = 300)

ggplot() +
  geom_abline(slope = 0, intercept = 0, linetype = "dashed", color = "black") +
  geom_line(data = times_df %>%
              filter(Category == "Sub 3"),
            aes(x = distance, y = on_par, group = RunnerID),
            color = "grey", alpha = 0.2) +
  geom_line(data = times_df %>%
              filter(Category == "Over 3"),
            aes(x = distance, y = on_par, group = RunnerID),
            color = "#464c89", alpha = 0.2) +
  labs(title = "Over-3 Runners in NYC Marathon 2025",
       x = "Distance (km)",
       y = "Difference from Goal Pace (seconds)") +
  theme_q()
ggsave("Output/Plots/over_3_comparison.png", width = 7, height = 4, dpi = 300)

ggplot() +
  geom_abline(slope = 0, intercept = 0, linetype = "dashed", color = "black") +
  geom_line(data = times_df %>% filter(Category == "Over 3"),
            aes(x = distance, y = on_par, group = RunnerID),
            color = "grey", alpha = 0.2) +
  geom_line(data = times_df %>% filter(Category == "Sub 3"),
            aes(x = distance, y = on_par, group = RunnerID),
            color = "#ff6b59", alpha = 0.2) +
  labs(title = "Sub-3 Runners in NYC Marathon 2025",
       x = "Distance (km)",
       y = "Difference from Goal Pace (seconds)") +
  theme_q()
ggsave("Output/Plots/sub_3_comparison.png", width = 7, height = 4, dpi = 300)

# classify on_par into three categories: "Ahead of Par" for values less than
# -20, "On Par" for values between -20 and 20, and "Behind Par" for values
# greater than 20 at the 10K mark, i.e. 2 seconds per km * 10 km = 20 seconds
class_df <- times_df %>%
  mutate(par_category = case_when(
    distance == 10 & on_par < -20 ~ "Ahead",
    distance == 10 & on_par >= -20 & on_par <= 20 ~ "Goal Pace",
    distance == 10 & on_par > 20 ~ "Behind",
    TRUE ~ NA_character_
  )) %>% 
  filter(!is.na(par_category))
# paste Category and par_category together to make a new column called final_category
class_df <- class_df %>%
  mutate(final_category = paste(Category, par_category, sep = " - ")) %>% 
  select(RunnerID, final_category)
# merge class_df with times_df to get the final_category for each runner in each category and drop any rows with NA values
times_df <- merge(times_df, class_df, by = "RunnerID", all.x = TRUE) %>%
  filter(!is.na(final_category))
# use my_levels to get facets in the order of my_level
times_df$final_category <- factor(times_df$final_category, levels = my_levels)
# ggplot of on_par by distance colored by final_category
ggplot(times_df, aes(x = distance, y = on_par, group = RunnerID, color = final_category)) +
  geom_abline(slope = 0, intercept = 0, linetype = "dashed", color = "black") +
  geom_line(alpha = 0.2) +
  scale_color_manual(values = my_colours) +
  labs(title = "Pacing at 10 km and Overall Outcome",
       x = "Distance (km)",
       y = "Difference from Par Time (seconds)",
       color = "Category") +
  theme(legend.position = "none") +
  facet_wrap(~ final_category) +
  theme_q() +
  guides(colour = guide_legend(override.aes = list(alpha = 1)))
ggsave("Output/Plots/pacing_by_final_category.png", width = 10, height = 6, dpi = 300)


# calculate the probability of success

# list of unique distances in numerical order
distance_list <- sort(unique(times_df$distance))

all_p_df <- tibble()
for(i in 1:length(distance_list)) {
  dist <- distance_list[i]
  par <- as_hms("00:00:02") * dist
  class_df <- times_df %>%
    mutate(par_category = case_when(
      distance == dist & on_par < -par ~ "Ahead",
      distance == dist & on_par >= -par & on_par <= par ~ "Goal Pace",
      distance == dist & on_par > par ~ "Behind",
      TRUE ~ NA_character_
    )) %>% 
    filter(!is.na(par_category)) %>%
    select(RunnerID, Category, par_category) %>%
    group_by(Category, par_category) %>%
    summarise(count = n()) %>%
    group_by(par_category) %>%
    mutate(percentage = count / sum(count) * 100) %>%
    ungroup() %>% 
    mutate(distance = dist) %>% 
    select(distance, Category, par_category, percentage)
  all_p_df <- rbind(all_p_df, class_df)
}

all_p_df$final_category <- paste(all_p_df$Category, all_p_df$par_category, sep = " - ")
all_p_df$final_category <- factor(all_p_df$final_category, levels = my_levels)
all_p_df %>% 
  filter(grepl("^Sub 3", final_category)) %>% 
  ggplot(aes(x = distance, y = percentage, colour = final_category)) +
  geom_line() +
  scale_color_manual(values = my_colours) +
  labs(title = "Probability of Success for Pacing Strategies by Distance",
       x = "Distance (km)",
       y = "Probability of Sub-3 (%)",
       color = "Category") +
  theme_q()
ggsave("Output/Plots/probability_of_success_sub_3.png", width = 7, height = 4, dpi = 300)

The post title comes from “Marathon Man” by Ian Brown from his “My Way” album.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Marathon Man II: how to pace a marathon]]>
400545
Do AI coding agents save scientists time? https://www.r-bloggers.com/2026/04/do-ai-coding-agents-save-scientists-time/ Mon, 13 Apr 2026 14:00:00 +0000 https://www.seascapemodels.org/posts/2026-04-14-does-agentic-AI-save-time/

I’m often asked if using AI coding agents saves time. Yes they write code very quickly and can complete entire ecological data analyses.

Do agents really help when the deadlines are approaching?

Do agents really help when the deadlines are ...

Continue reading: Do AI coding agents save scientists time?]]>
[social4i size="small" align="align-left"] -->
[This article was first published on Seascapemodels, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m often asked if using AI coding agents saves time. Yes they write code very quickly and can complete entire ecological data analyses.

Do agents really help when the deadlines are approaching?

Do agents really help when the deadlines are approaching?

But the code also requires careful checking for logical errors. Our recent analysis shows this. The best LLMs can complete entire analyses and all the code works well. But there was a decent chance of subtle logical errors. These logical errors would require pretty deep human understanding of the code to correct.

There’s another issue and that is using code you don’t understand. I often find the agents produce so much code, but I’m not comfortable using it until I understand the logic line-by-line.

In those cases I find its faster to use an autocomplete AI assistant so I’m going line-by-line, rather than an agentic loop that completes the entire piece of work.

I think the jury is still out on this question of whether there is a net time benefit to using agents. The only way to really answer is a randomised control trial where you time how long it takes scientists to fully complete a task.

The only study I’m aware of is quite limited and looked at software developers. They found the developers often projected they would be faster with the AI tools, but were actually slower at tasks by the end of the project.

Its true that using AI is fun because it makes so much progress, but that fun feeling might be a trap for us.

Its likely that the answer is context dependent.

I suspect for scientists most of the coding we do (like writing models that represent ecosystems) actually requires the human to understand what it does. In these cases agents don’t make sense, because you need to go back and review the code carefully to understand it anyway.

On the other hand, if you are making software tools that are easy to verify then agents are great. For instance, I often use them to write code for non-standard figures. I don’t need to know the code in that case because I can check the output is correct visually.

Likewise interactive shiny apps are another example of time saving. The agent can take some (good) code you already have and turn it into an app. Its easy to test and check because you just use the app.

People often point to advances in LLMs and say that soon they will be good enough to do all the coding for us. I’m not so sure that applies to science. In fact, we found the later version of Claude Sonnet performed about the same as an earlier version on scientific logic, it just made different types of errors.

I think the advances need to come in the ways we interact and use the LLMs.

The ultimately goal should be efficient but also high quality work. That’s something I want to look at in my next agentic AI study.

To leave a comment for the author, please follow the link and comment on their blog: Seascapemodels.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Do AI coding agents save scientists time?]]>
400543
reviser: Analyzing Real-Time Data Revisions in R https://www.r-bloggers.com/2026/04/reviser-analyzing-real-time-data-revisions-in-r/ Mon, 13 Apr 2026 00:00:00 +0000 https://ropensci.org/blog/2026/04/13/reviser/ Economic data are rarely static.
Gross domestic product (GDP), inflation, employment, and other official statistics arrive as early estimates, then get revised as new source data arrive, seasonal adjustment is updated, or benchmarking changes are appl...

Continue reading: reviser: Analyzing Real-Time Data Revisions in R]]>
[social4i size="small" align="align-left"] -->
[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Economic data are rarely static. Gross domestic product (GDP), inflation, employment, and other official statistics arrive as early estimates, then get revised as new source data arrive, seasonal adjustment is updated, or benchmarking changes are applied. Those revisions matter because they can change the narrative around turning points, policy mistakes, and forecast performance.

reviser is an R package by Marc Burri and Philipp Wegmüller for working with these vintage datasets directly. A vintage dataset records multiple published versions of the same time series, so you can compare what was known at each release date with what was reported later. reviser gives you a consistent workflow to:

  • reshape release vintages between wide and tidy formats;
  • extract revisions relative to earlier or final releases;
  • summarize bias, dispersion, and serial dependence in revisions;
  • identify the first release that is statistically close to the eventual benchmark;
  • nowcast future revisions with state-space models.

The package is aimed at users who already work with real-time macroeconomic data and want tools that go beyond plotting revision triangles by hand. One design goal is to keep that workflow in pure R.

reviser was reviewed through rOpenSci statistical software peer review. Many thanks to reviewers Alex Gibberd, and Tanguy Barthelemy, and to editor Rebecca Killick, for feedback that improved the package.

Why revision analysis deserves its own workflow

Revisions are not just measurement noise. They encode how information enters the data-production process.

  • Some revisions reflect genuinely new information.
  • Others reflect noise that could, in principle, have been reduced earlier.
  • Still others come from methodological changes or benchmark updates.

These distinctions matter if you are evaluating early data releases, building nowcasts, or asking whether first releases are already efficient summaries of the available information.

The reviser vignettes organize this workflow into three layers:

  1. Structure vintages consistently.
  2. Measure and test revision properties.
  3. Model the revision process when you want to predict future changes.

A compact example with GDP vintages

The first step is to reshape the data into a tidy vintage format, where each row corresponds to an observed value, the date it refers to, and the publication date of that estimate.

The package ships with a GDP example dataset in long vintage format. Suppose we want to focus on U.S. GDP growth, visualize how estimates moved during the 2008-09 global financial crisis, and then ask whether early releases were systematically biased relative to a later benchmark.

library(reviser)
library(dplyr)
library(tsbox)

gdp_us <- gdp |>
 filter(id == "US") |>
 tsbox::ts_pc() |>
 tsbox::ts_span(start = "1980-01-01")

gdp_wide <- vintages_wide(gdp_us)
gdp_long <- vintages_long(gdp_wide, keep_na = FALSE)

With the vintages in tidy form, we can plot how the published path changed over time. The y-axis in the figure reports quarter-on-quarter GDP growth rates.

plot_vintages(
 gdp_long |>
 filter(
 pub_date >= as.Date("2009-01-01"),
 pub_date < as.Date("2010-01-01"),
 time > as.Date("2008-01-01"),
 time < as.Date("2010-01-01")
 ),
 type = "line",
 title = "Revisions of GDP during the 2008-09 global financial crisis",
 ylab = "Quarter-on-quarter GDP growth rate"
)
Multiple vintage paths for U.S. GDP growth, highlighting how estimates published in 2009 changed over time.

GDP growth vintages for the United States during the 2008-09 global financial crisis.

During volatile periods, the vintage paths can diverge enough that the story told by the first release is noticeably different from the story told a year later.

Once the data are in tidy vintage form, you can compare a set of early releases to a later benchmark release.

final_release <- get_nth_release(gdp_long, n = 10)
early_releases <- get_nth_release(gdp_long, n = 0:6)

summary_tbl <- get_revision_analysis(early_releases, final_release)

Warning: Both 'release' and 'pub_date' columns are present in 'df.
The 'release' column will be used for grouping.

summary_tbl |>
 select(id, release, `Bias (mean)`, `Bias (robust p-value)`, `Noise/Signal`)


=== Revision Analysis Summary ===
# A tibble: 7 × 5
id release `Bias (mean)` `Bias (robust p-value)` `Noise/Signal`
<chr> <chr> <dbl> <dbl> <dbl>
1 US release_0 -0.014 0.52 0.22
2 US release_1 -0.015 0.425 0.202
3 US release_2 -0.013 0.507 0.205
4 US release_3 -0.003 0.851 0.194
5 US release_4 -0.014 0.326 0.157
6 US release_5 -0.021 0.181 0.152
7 US release_6 -0.018 0.202 0.13
=== Interpretation ===
id=US, release=release_0:
• No significant bias detected (p = 0.52 )
• Moderate revision volatility (Noise/Signal = 0.22 )
id=US, release=release_1:
• No significant bias detected (p = 0.425 )
• Moderate revision volatility (Noise/Signal = 0.202 )
id=US, release=release_2:
• No significant bias detected (p = 0.507 )
• Moderate revision volatility (Noise/Signal = 0.205 )
id=US, release=release_3:
• No significant bias detected (p = 0.851 )
• Moderate revision volatility (Noise/Signal = 0.194 )
id=US, release=release_4:
• No significant bias detected (p = 0.326 )
• Moderate revision volatility (Noise/Signal = 0.157 )
id=US, release=release_5:
• No significant bias detected (p = 0.181 )
• Moderate revision volatility (Noise/Signal = 0.152 )
id=US, release=release_6:
• No significant bias detected (p = 0.202 )
• Moderate revision volatility (Noise/Signal = 0.13 )

This is where reviser moves beyond a reshape-and-plot package. The revision summary reports quantities that applied work often needs but usually rebuilds ad hoc: mean bias, quantiles, volatility, noise-to-signal ratios, and hypothesis tests for bias, serial correlation, and news-versus-noise interpretations.

In the bundled example, the early U.S. GDP releases over this sample show little evidence of systematic bias relative to the later benchmark. The package also supports efficient-release diagnostics, where the question is not only whether revisions exist, but when additional revisions stop adding meaningful information.

efficient_release <- get_first_efficient_release(early_releases, final_release)
summary(efficient_release)

Efficient release: 0
Model summary:
Call:
stats::lm(formula = formula, data = df_wide)
Residuals:
Min 1Q Median 3Q Max
-0.89186 -0.12669 0.02046 0.11475 0.97986
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.00299 0.02223 0.134 0.893
release_0 0.97412 0.01692 57.567 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2518 on 166 degrees of freedom
(10 observations deleted due to missingness)
Multiple R-squared: 0.9523, Adjusted R-squared: 0.952
F-statistic: 3314 on 1 and 166 DF, p-value: < 2.2e-16
Test summary:
Linear hypothesis test:
(Intercept) = 0
release_0 = 1
Model 1: restricted model
Model 2: final ~ release_0
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 168
2 166 2 1.9283 0.1486

That is exactly the kind of result that is hard to see from a revision triangle alone but straightforward to formalize once the workflow is standardized. In this sample, the result points to the first release as already being statistically close to the later benchmark, which suggests subsequent revisions add relatively little systematic information.

From descriptive analysis to revision nowcasting

For many users, revision summaries will be the main use case. But reviser also includes model-based tools for users who want to treat revisions as an explicit latent-data problem. That matters if you need to make decisions on preliminary data but also want a structured way to estimate how those figures are likely to change later.

Two vignettes walk through nowcasting revisions with:

  • the generalized Kishor-Koenig family via kk_nowcast();
  • the Jacobs-Van Norden model via jvn_nowcast().

Both approaches cast the revision problem into state-space form, which makes it possible to estimate the dynamics of news and noise in successive data releases. For technical users, this is the part of the package that turns revision analysis from retrospective diagnosis into a forecasting problem.

Here is a compact kk_nowcast() example following the Kishor-Koenig workflow from the vignette for the Euro Area (EA).
The key idea is to first identify an efficient release e, then estimate the revision system on the corresponding panel of releases. In this euro area example, the efficient-release step selects e = 2, so the model treats the third published release as the earliest one that is already close to the later benchmark. That is a useful substantive result on its own: it suggests that most of the economically relevant signal arrives within the first few releases, while later revisions are smaller adjustments around that path.

gdp_ea <- reviser::gdp |>
 tsbox::ts_pc() |>
 dplyr::filter(
 id == "EA",
 time >= min(pub_date),
 time <= as.Date("2020-01-01")
 )

releases <- get_nth_release(gdp_ea, n = 0:14)
final_release <- get_nth_release(gdp_ea, n = 15)

efficient_release <- get_first_efficient_release(releases, final_release)

fit_kk <- kk_nowcast(
 df = efficient_release$data,
 e = efficient_release$e,
 model = "KK",
 method = "MLE"
)

summary(fit_kk)


=== Kishor-Koenig Model ===
Convergence: Success
Log-likelihood: 125.7
AIC: -231.41
BIC: -198.23
Parameter Estimates:
Parameter Estimate Std.Error
F0 0.633 0.131
G0_0 0.950 0.031
G0_1 -0.037 0.152
G0_2 -0.181 0.220
G1_0 -0.009 0.011
G1_1 0.594 0.061
G1_2 0.194 0.092
v0 0.380 0.068
eps0 0.008 0.001
eps1 0.001 0.000

plot(fit_kk)
Diagnostic plot from a Kishor-Koenig nowcast model for euro area GDP revisions, summarizing the fitted revision dynamics.

Diagnostic plot from the Kishor-Koenig nowcast example.

The fitted object contains estimated parameters, filtered and smoothed latent states, and plotting methods for the implied efficient-release path. That gives you a direct route from descriptive revision analysis to a state-space nowcast of future revisions. For a broader audience, the main takeaway is not the individual coefficients. It is that the model converges cleanly on this sample, summarizes the revision process in a compact latent-state form, and provides a practical way to judge whether a new release is likely to be revised materially later on. Substantively, the model separates persistent signal from transitory revision noise, so the output is useful when you want to judge whether new releases are likely to be revised materially.

What reviser adds

What stands out in reviser is not a single function, but the coherence of the workflow:

  • the package has explicit conventions for vintage data;
  • descriptive revision analysis and formal testing sit in the same API;
  • efficient-release analysis connects directly to applied questions about which release to trust;
  • nowcasting tools extend the same workflow rather than forcing a separate modeling stack.

If you work with real-time macroeconomic data, that combination is useful because revision analysis is usually fragmented across custom scripts, one-off spreadsheets, and package combinations that do not share a common data structure.

Try it and push it further

You can install the package from the rOpenSci R-universe:

install.packages(
 "reviser",
 repos = c("https://ropensci.r-universe.dev", "https://cloud.r-project.org")
)

Then start with the package site and vignettes:

We would be happy to hear feedback from those of you trying out the package with different datasets. If you have a real-time dataset with a different release structure, that would be a good stress test for the package. If you find gaps in the workflow or have a use case to share, open an issue or contribute an example. Revision analysis gets more useful as it becomes easier to compare workflows across datasets rather than rebuilding them from scratch each time.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: reviser: Analyzing Real-Time Data Revisions in R]]>
400526
`mlS3` — A Unified S3 Machine Learning Interface in R https://www.r-bloggers.com/2026/04/mls3-a-unified-s3-machine-learning-interface-in-r/ Sun, 12 Apr 2026 00:00:00 +0000 https://thierrymoudiki.github.io//blog/2026/04/12/r/intro-mlS3 Introduction to `mlS3` — A Unified S3 Machine Learning Interface in R
Continue reading: `mlS3` — A Unified S3 Machine Learning Interface in R]]>
[social4i size="small" align="align-left"] -->
[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Overview

Following the R6 object-based package unifiedml introduced last week, this blog post introduces the mlS3 R package, which strives to provide a unified, consistent S3 interface for training and predicting from a variety of popular machine learning models. Rather than learning the idiosyncratic API of each package (you’d still need to read their docs to see the parameters specification, though), mlS3 wraps them under a common wrap_* / predict() pattern.

What You’ll Learn

  • How to install and load mlS3 (for now, from GitHub)
  • How to apply a consistent API across multiple ML algorithms for both classification and regression tasks

Models Covered

Wrapper Underlying Package Task(s)
wrap_glmnet() glmnet generalized linear models Classification, Regression
wrap_lightgbm() lightgbm gradient boosting Classification, Regression
wrap_ranger() ranger random forest Classification, Regression
wrap_svm() e1071 support vector machines Classification, Regression
wrap_caret() caret package Classification, Regression with caret 200+ models

Datasets Used

  • iris — binary and multiclass classification (setosa/versicolor, all three species)
  • mtcars — regression to predict miles per gallon (mpg)

Key Design Principle

All models follow the same two-step workflow:

mod  <- wrap_*(X_train, y_train, ...)       # Train
pred <- predict(mod, newx = X_test, ...)    # Predict

This makes it easy to swap algorithms and compare results without rewriting your pipeline.

Prerequisites

  • R with the following packages: remotes, caret, randomForest, ggplot2
  • mlS3 installed from GitHub (for now) via remotes::install_github("Techtonique/mlS3")

Code

Install packages

install.packages(c("remotes"))

install.packages(c("caret"))

install.packages(c("randomForest"))

remotes::install_github("Techtonique/mlS3")

Predefined wrappers

# Classification

library(mlS3)

# =============================================================================
# Classification examples (no leakage)
# =============================================================================
set.seed(123)

# --- Binary classification: iris setosa vs versicolor ---
iris_bin <- iris[iris$Species != "virginica", ]
X_bin <- iris_bin[, 1:4]
y_bin <- droplevels(iris_bin$Species)

# Split into train/test
idx_bin <- sample(nrow(X_bin), 0.7 * nrow(X_bin))
X_bin_train <- X_bin[idx_bin, ]
y_bin_train <- y_bin[idx_bin]
X_bin_test  <- X_bin[-idx_bin, ]
y_bin_test  <- y_bin[-idx_bin]

# glmnet
mod <- wrap_glmnet(X_bin_train, y_bin_train, family = "binomial")
pred_bin_glmnet <- predict(mod, newx = X_bin_test, type = "class")
acc_glmnet <- mean(pred_bin_glmnet == y_bin_test)

cat("Accuracy (glmnet): ", acc_glmnet, "\n")


# --- Multiclass classification: iris all species ---
X_multi <- iris[, 1:4]
y_multi <- iris$Species

# Split into train/test
idx_multi <- sample(nrow(X_multi), 0.7 * nrow(X_multi))
X_multi_train <- X_multi[idx_multi, ]
y_multi_train <- y_multi[idx_multi]
X_multi_test  <- X_multi[-idx_multi, ]
y_multi_test  <- y_multi[-idx_multi]

# lightgbm
mod <- wrap_lightgbm(X_multi_train, y_multi_train,
                     params = list(objective = "multiclass",
                                   num_class = 3, verbose = -1),
                     nrounds = 150)
pred_multi_lightgbm <- predict(mod, newx = X_multi_test, type = "class")
acc_lightgbm <- mean(pred_multi_lightgbm == y_multi_test)

# ranger
mod <- wrap_ranger(X_multi_train, y_multi_train, num.trees = 100L)
pred_multi_ranger <- predict(mod, newx = X_multi_test, type = "class")
acc_ranger <- mean(pred_multi_ranger == y_multi_test)

# svm
mod <- wrap_svm(X_multi_train, y_multi_train, kernel = "radial")
pred_multi_svm <- predict(mod, newx = X_multi_test, type = "class")
acc_svm <- mean(pred_multi_svm == y_multi_test)

cat("Accuracy (lightgbm): ", acc_lightgbm, "\n")
cat("Accuracy (ranger): ", acc_ranger, "\n")
cat("Accuracy (svm): ", acc_svm, "\n")


# Regression


# =============================================================================
# Regression examples (mtcars)
# =============================================================================
X_reg <- mtcars[, -1]
y_reg <- mtcars$mpg

# Split into train/test
set.seed(123)
idx_reg <- sample(nrow(X_reg), 0.7 * nrow(X_reg))
X_reg_train <- X_reg[idx_reg, ];  y_reg_train <- y_reg[idx_reg]
X_reg_test  <- X_reg[-idx_reg, ]; y_reg_test  <- y_reg[-idx_reg]

# lightgbm
mod <- wrap_lightgbm(X_reg_train, y_reg_train,
                     params = list(objective = "regression", verbose = -1),
                     nrounds = 50)
pred_reg_lightgbm <- predict(mod, newx = X_reg_test)
rmse_lightgbm <- sqrt(mean((pred_reg_lightgbm - y_reg_test)^2))

# glmnet
mod <- wrap_glmnet(X_reg_train, y_reg_train, alpha = 0)
pred_reg_glmnet <- predict(mod, newx = X_reg_test)
rmse_glmnet <- sqrt(mean((pred_reg_glmnet - y_reg_test)^2))

# svm
mod <- wrap_svm(X_reg_train, y_reg_train)
pred_reg_svm <- predict(mod, newx = X_reg_test)
rmse_svm <- sqrt(mean((pred_reg_svm - y_reg_test)^2))

# ranger
mod <- wrap_ranger(X_reg_train, y_reg_train, num.trees = 100L)
pred_reg_ranger <- predict(mod, newx = X_reg_test)
rmse_ranger <- sqrt(mean((pred_reg_ranger - y_reg_test)^2))

cat("RMSE (lightgbm): ", rmse_lightgbm, "\n")
cat("RMSE (glmnet): ", rmse_glmnet, "\n")
cat("RMSE (svm): ", rmse_svm, "\n")
cat("RMSE (ranger): ", rmse_ranger, "\n")



Accuracy (glmnet):  1 
Accuracy (lightgbm):  0.2444444  # I'm probably doing something wrong here
Accuracy (ranger):  0.9333333 
Accuracy (svm):  0.9333333 
RMSE (lightgbm):  4.713678 
RMSE (glmnet):  2.972557 
RMSE (svm):  2.275837 
RMSE (ranger):  2.067692 

caret wrapper

For this part, you need to install package caret and randomForest. Model parameters available for caret: https://topepo.github.io/caret/available-models.html

library(mlS3)
library(caret)

# ============================================================================
# Regression with mtcars dataset
# ============================================================================
data(mtcars)

# Prepare data
X_reg <- mtcars[, -1]  # All except mpg
y_reg <- mtcars$mpg     # Target variable

# Split into train/test
set.seed(123)
idx_reg <- sample(nrow(X_reg), 0.7 * nrow(X_reg))
X_reg_train <- X_reg[idx_reg, ]
y_reg_train <- y_reg[idx_reg]
X_reg_test <- X_reg[-idx_reg, ]
y_reg_test <- y_reg[-idx_reg]

# ----------------------------------------------------------------------------
# Example 1: Random Forest with specific parameters
# ----------------------------------------------------------------------------
cat("\n=== Example 1: Random Forest Regression ===\n")

mod_rf <- wrap_caret(X_reg_train, y_reg_train,
                     method = "rf",
                     mtry = 3)        # Number of variables sampled at each split

print(mod_rf)

# Predictions
pred_rf <- predict(mod_rf, newx = X_reg_test)
rmse_rf <- sqrt(mean((pred_rf - y_reg_test)^2))
r2_rf <- 1 - sum((y_reg_test - pred_rf)^2) / sum((y_reg_test - mean(y_reg_test))^2)

cat("RMSE:", round(rmse_rf, 3), "\n")
cat("R-squared:", round(r2_rf, 3), "\n")


=== Example 1: Random Forest Regression ===
$model
Random Forest 

22 samples
10 predictors

No pre-processing
Resampling: None 

$task
[1] "regression"

$method
[1] "rf"

$parameters
$parameters$mtry
[1] 3


attr(,"class")
[1] "wrap_caret"
RMSE: 2.007 
R-squared: 0.681 

library(ggplot2)

df <- data.frame(
  pred = pred_rf,
  actual = y_reg_test
)

ggplot(df, aes(x = pred, y = actual)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  theme_minimal() +
  labs(x = "Predicted", y = "Actual")

image-title-here

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: `mlS3` — A Unified S3 Machine Learning Interface in R]]>
400505
Test & Roll: Why Smaller A/B Tests Can Make More Money https://www.r-bloggers.com/2026/04/test-roll-why-smaller-a-b-tests-can-make-more-money/ Sun, 12 Apr 2026 00:00:00 +0000 http://flovv.github.io/test-and-roll-profit-maximizing-ab-tests/ Short practical advice on A/B testing:

Stop sizing tests only for statistical significance - In finite campaigns, your goal is profit, not perfect inference.

Treat testing as a trade-off - Every extra test exposure buys learning but ...

Continue reading: Test & Roll: Why Smaller A/B Tests Can Make More Money]]>
[social4i size="small" align="align-left"] -->
[This article was first published on Florian Teschner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Short practical advice on A/B testing:

  1. Stop sizing tests only for statistical significance – In finite campaigns, your goal is profit, not perfect inference.

  2. Treat testing as a trade-off – Every extra test exposure buys learning but also burns revenue if that exposure gets the weaker treatment.

  3. Use smaller tests when outcomes are noisy – This paper shows profit-maximizing test sizes rise much more slowly than classical power-based sizes.

  4. Scale test size with reachable audience – If your population is limited, test size should reflect that constraint directly.

  5. Allow unequal splits when priors differ – If one treatment is likely better a priori (e.g., treatment vs holdout), asymmetric test cells can be optimal.

Shiny App to test the implications:

Test and Roll Shiny App

Long Version

I just read Test & Roll: Profit-Maximizing A/B Tests by Elea McDonnell Feit and Ron Berman (2019), and it challenges one of the default habits in marketing experimentation: planning tests as if the main objective were statistical significance.

Their point is simple: in most real marketing experiments, you have a finite population (email list, campaign budget, limited traffic window). In that setting, the right objective is total expected profit across test + rollout, not p-values.

The core idea

A classic A/B setup has two stages:

  1. Test stage: expose n1 users to treatment A and n2 users to treatment B.
  2. Roll stage: deploy the winner to the remaining N - n1 - n2 users.

Bigger tests improve certainty, but they also create opportunity cost: more users in test means more users potentially seeing the weaker treatment before rollout.

The paper formalizes this as a decision problem and derives profit-maximizing sample sizes. Under Normal priors and Normal outcomes, they get closed-form solutions.

Why this matters in practice

If you use classical hypothesis-test sizing, recommended n can be huge, especially when effect sizes are small and response is noisy (which is exactly what we see in advertising).

Their framework produces much smaller test sizes because it optimizes business outcomes, not Type I/II error control.

Two important takeaways:

  1. Optimal test sizes grow sub-linearly with response noise, while classical sample size rules grow much faster.
  2. Optimal test sizes scale with the square root of population size N, which makes them workable for smaller markets and finite campaigns.

Comparison with bandits

The authors benchmark against Thompson sampling (multi-armed bandit). Bandits usually win on pure optimization, but the gap is often modest in their examples.

That is useful operationally: a two-stage “test then roll” process is far easier to implement, explain, and govern than a continuously-adapting bandit, especially in organizations with approval and reporting constraints.

The applications are the best part

They test the approach in three contexts:

  1. Website design experiments
  2. Display advertising decisions
  3. Catalog holdout tests

Across cases, profit-maximizing designs use substantially smaller test cells than classical power calculations and produce higher expected profit.

A particularly practical result: small holdout groups (common in catalog and CRM practice) can be fully rational when priors are asymmetric. In other words, “unequal splits” are not always bad design; they can be the optimal design.

What I changed in my own thinking

Before this, I treated “underpowered” mostly as a red flag. After this paper, I think a better question is:

Underpowered for what objective?

If the objective is publication-grade inference, classical power logic is right. If the objective is campaign profit in a finite horizon, a smaller test can be the better business decision.

Practical implementation checklist

If you run tactical tests (email, paid media, landing pages), this paper suggests a better workflow:

  1. Define total reachable population N for the decision horizon.
  2. Set priors for treatment means from past similar experiments.
  3. Estimate response variance from historical data.
  4. Compute profit-maximizing n1, n2.
  5. Pre-commit the rollout decision rule (posterior expected profit winner).
  6. Report expected regret alongside expected upside.

That last point is underrated: decision-makers usually understand “expected dollars at risk” better than p-values.

Bottom line

For many real marketing tests, “smaller than textbook” is not bad science. It is better decision design.

If your experiment exists to drive a business action on a finite audience, Test & Roll gives a rigorous way to choose sample sizes that maximize profit instead of statistical purity.


Paper: Feit, E. M., & Berman, R. (2019). Test & Roll: Profit-Maximizing A/B Tests. SSRN: https://ssrn.com/abstract=3274875

To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Test & Roll: Why Smaller A/B Tests Can Make More Money]]>
400493
Machine Learning Frameworks in R https://www.r-bloggers.com/2026/04/machine-learning-frameworks-in-r/ Sat, 11 Apr 2026 18:30:00 +0000 https://rtichoke.netlify.app/posts/ml-frameworks-in-r.html R’s ecosystem offers a rich selection of machine learning frameworks, each with distinct design philosophies and strengths. This post is a side-by-side comparison of five ML frameworks in R that provide unified interfaces over multiple algorithms...

Continue reading: Machine Learning Frameworks in R]]>
[social4i size="small" align="align-left"] -->
[This article was first published on R'tichoke, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R’s ecosystem offers a rich selection of machine learning frameworks, each with distinct design philosophies and strengths. This post is a side-by-side comparison of five ML frameworks in R that provide unified interfaces over multiple algorithms, with runnable code examples on the same dataset so you can compare APIs directly. The focus is on packages that let you swap algorithms without rewriting your code.

Frameworks at a Glance

Feature tidymodels caret mlr3 h2o qeML
Built-in tuning ✅ (tune) ✅ ✅ (mlr3tuning) ✅ (AutoML) ✅ (qeFT())
Preprocessing pipeline ✅ (recipes) ✅ (preProcess) ✅ (mlr3pipelines) ✅ ❌
Model variety 200+ engines 230+ methods 100+ learners GBM, GLM, DL, DRF 20+ wrappers
Relative speed Moderate Moderate Moderate Fast (distributed) Depends on backend
Learning curve Medium Low High Low Very low
Active development ✅ ⚠ Maintenance mode ✅ ✅ ✅
Best for Production pipelines Quick prototyping Benchmarking AutoML & scale Teaching & exploration

Setup and Data

ALl examples below uses the iris classification task: predict Species from the four numeric measurements. A single train/test split is created up front so results are directly comparable.

library(dplyr)

# Reproducible train/test split (framework-agnostic)
set.seed(42)
n <- nrow(iris)
train_idx <- sample(seq_len(n), size = floor(0.7 * n))

train_data <- iris[train_idx, ]
test_data  <- iris[-train_idx, ]

# Store accuracy results for final comparison
results <- data.frame(
  Framework = character(),
  Model = character(),
  Accuracy = numeric(),
  stringsAsFactors = FALSE
)

cat("Training set:", nrow(train_data), "observations\n")
Training set: 105 observations
cat("Test set:    ", nrow(test_data), "observations\n")
Test set:     45 observations

1. tidymodels

The tidymodels ecosystem is the modern, tidyverse native approach to modeling in R. It provides a consistent grammar for specifying models (parsnip), preprocessing (recipes), composing workflows (workflows), and tuning hyperparameters (tune).

library(tidymodels)

# Define a recipe (preprocessing)
rec <- recipe(Species ~ ., data = train_data)

# Define a model specification
rf_spec <- rand_forest(trees = 500) %>%
  set_engine("ranger") %>%
  set_mode("classification")

# Combine into a workflow
rf_wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(rf_spec)

# Fit the workflow
rf_fit <- rf_wf %>% fit(data = train_data)

# Predict on test set
preds_tidy <- predict(rf_fit, test_data) %>%
  bind_cols(test_data %>% select(Species))

# Evaluate
acc_tidy <- accuracy(preds_tidy, truth = Species, estimate = .pred_class)
acc_tidy
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.978
results <- rbind(results, data.frame(
  Framework = "tidymodels",
  Model = "Random Forest (ranger)",
  Accuracy = acc_tidy$.estimate
))
TipKey Strengths
  • Composable pipeline: recipe + model + workflow is easy to extend
  • Swap engines with one line (set_engine("xgboost"))
  • Seamless cross-validation and hyperparameter tuning via tune_grid() / tune_bayes()
  • Deep integration with the tidyverse

2. caret

The caret package (Classification And REgression Training) was the de facto standard for ML in R for over a decade. It wraps 230+ models behind a single train() interface. While now in maintenance mode (its creator, Max Kuhn, leads tidymodels), it still remains widely used.

library(caret)

# Train a random forest with 5-fold CV
ctrl <- trainControl(method = "cv", number = 5)

rf_caret <- train(
  Species ~ .,
  data = train_data,
  method = "rf",
  trControl = ctrl,
  tuneLength = 3  # Try 3 values of mtry
)

# Best tuning parameter
rf_caret$bestTune
  mtry
1    2
# Predict on test set
preds_caret <- predict(rf_caret, test_data)

# Evaluate
cm_caret <- confusionMatrix(preds_caret, test_data$Species)
cm_caret$overall["Accuracy"]
 Accuracy 
0.9555556 
results <- rbind(results, data.frame(
  Framework = "caret",
  Model = "Random Forest (rf)",
  Accuracy = as.numeric(cm_caret$overall["Accuracy"])
))
TipKey Strengths
  • Minimal API: a single train() call handles preprocessing, tuning, and fitting
  • 230+ model methods available out of the box
  • Built-in confusionMatrix() with extensive diagnostics
  • Massive community knowledge base and Stack Overflow coverage

3. mlr3

mlr3 is a modern, object-oriented ML framework built on R6 classes. It excels at systematic benchmarking, composable pipelines, and reproducible experiments. The learning curve is steeper, but the payoff is a powerful, extensible architecture.

library(mlr3)
library(mlr3learners)

# Define the task
task <- TaskClassif$new(
  id = "iris",
  backend = train_data,
  target = "Species"
)

# Define the learner
learner <- lrn("classif.ranger", num.trees = 500)

# Train
learner$train(task)

# Predict on test data — create a test task to avoid backend storage issues
test_task <- TaskClassif$new(
  id = "iris_test",
  backend = test_data,
  target = "Species"
)
pred_mlr3 <- learner$predict(test_task)

# Evaluate
acc_mlr3 <- pred_mlr3$score(msr("classif.acc"))
acc_mlr3
classif.acc 
  0.9555556 
results <- rbind(results, data.frame(
  Framework = "mlr3",
  Model = "Random Forest (ranger)",
  Accuracy = as.numeric(acc_mlr3)
))
TipKey Strengths
  • R6 object-oriented design — everything is an object with methods
  • First-class benchmarking: compare multiple learners on multiple tasks with benchmark()
  • Composable pipelines via mlr3pipelines (stacking, ensembling, feature engineering)
  • Built-in resampling strategies and performance measures

4. h2o (AutoML)

h2o is a distributed machine learning platform with a powerful R interface. Its standout feature is h2o.automl() automatic model selection, hyperparameter tuning, and stacked ensemble creation with a single function call. It runs on a local JVM, so Java must be installed.

Note

This section requires Java (JDK 8+) to be installed. h2o starts a local JVM-based server. If you don’t have Java, skip to the results comparison — the other four frameworks cover the same ground without this dependency.

library(h2o)

# Start a local h2o cluster (uses available cores)
h2o.init(nthreads = -1, max_mem_size = "2G")
H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    C:\Users\RIDDHI~1\AppData\Local\Temp\Rtmp8G4u9C\filec0caad7dcc/h2o_Riddhiman_Roy_started_from_r.out
    C:\Users\RIDDHI~1\AppData\Local\Temp\Rtmp8G4u9C\filec0c1660783f/h2o_Riddhiman_Roy_started_from_r.err


Starting H2O JVM and connecting:  Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 seconds 276 milliseconds 
    H2O cluster timezone:       Asia/Kolkata 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.44.0.3 
    H2O cluster version age:    2 years, 3 months and 23 days 
    H2O cluster name:           H2O_started_from_R_Riddhiman_Roy_axb153 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.98 GB 
    H2O cluster total cores:    24 
    H2O cluster allowed cores:  24 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 4.5.3 (2026-03-11 ucrt) 
h2o.no_progress()  # Suppress progress bars

# Convert data to h2o frames
train_h2o <- as.h2o(train_data)
test_h2o  <- as.h2o(test_data)

# Run AutoML — automatic model selection and stacking
aml <- h2o.automl(
  x = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
  y = "Species",
  training_frame = train_h2o,
  max_models = 10,
  seed = 42
)
22:11:38.3: AutoML: XGBoost is not available; skipping it.
22:11:39.171: _min_rows param, The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 105.0.
# Leaderboard — best models ranked by cross-validated performance
h2o.get_leaderboard(aml) |> as.data.frame() |> head(5)
                                                 model_id mean_per_class_error
1    DeepLearning_grid_1_AutoML_1_20260412_221137_model_1           0.03988095
2                          GBM_2_AutoML_1_20260412_221137           0.05029762
3                          GLM_1_AutoML_1_20260412_221137           0.05029762
4    StackedEnsemble_AllModels_1_AutoML_1_20260412_221137           0.05982143
5 StackedEnsemble_BestOfFamily_1_AutoML_1_20260412_221137           0.05982143
     logloss      rmse        mse
1 0.10262590 0.1802887 0.03250400
2 0.13688347 0.1981121 0.03924839
3 0.09073184 0.1736624 0.03015862
4 0.12933104 0.2002635 0.04010548
5 0.11828660 0.1921656 0.03692762
# Predict with the best model
preds_h2o <- h2o.predict(aml@leader, test_h2o)
acc_h2o <- mean(as.vector(preds_h2o$predict) == as.vector(test_h2o$Species))
cat("Accuracy:", acc_h2o, "\n")
Accuracy: 0.9777778 
results <- rbind(results, data.frame(
  Framework = "h2o",
  Model = paste0("AutoML (", aml@leader@algorithm, ")"),
  Accuracy = acc_h2o
))

# Shutdown h2o
h2o.shutdown(prompt = FALSE)
TipKey Strengths
  • h2o.automl() — fully automatic model selection, tuning, and stacked ensembles
  • Trains GBM, XGBoost, GLM, DRF, and deep learning models in one call
  • Distributed computing — scales to datasets larger than memory
  • Built-in leaderboard for model comparison
  • Production deployment via MOJO/POJO model export

5. qeML

qeML (Quick and Easy Machine Learning) takes a different approach of minimizing boilerplate. Every algorithm — random forest, gradient boosting, SVM, KNN, LASSO, neural nets, and more is wrapped behind a one-line qe*() function with a consistent (data, targetName) signature. No formula objects, no matrix conversions, no separate predict calls just results. It’s ideal for teaching, exploration, and quick model comparisons.

library(qeML)

# qeML convention: pass full data + target name (string)
# It handles train/test splitting internally via holdout
# But to match our split, we'll train on train_data and predict on test_data
# predict() expects new data WITHOUT the target column
test_features <- test_data[, -which(names(test_data) == "Species")]

# Random Forest (wraps randomForest)
rf_qe <- qeRF(train_data, "Species")
preds_rf_qe <- predict(rf_qe, test_features)
acc_rf_qe <- mean(preds_rf_qe$predClasses == test_data$Species)
cat("Random Forest accuracy:", acc_rf_qe, "\n")
Random Forest accuracy: 0.9777778 
# Gradient Boosting (wraps gbm)
gb_qe <- qeGBoost(train_data, "Species")
preds_gb_qe <- predict(gb_qe, test_features)
acc_gb_qe <- mean(preds_gb_qe$predClasses == test_data$Species)
cat("Gradient Boosting accuracy:", acc_gb_qe, "\n")
Gradient Boosting accuracy: 0.9555556 
# SVM (wraps e1071)
svm_qe <- qeSVM(train_data, "Species")
preds_svm_qe <- predict(svm_qe, test_features)
acc_svm_qe <- mean(preds_svm_qe$predClasses == test_data$Species)
cat("SVM accuracy:", acc_svm_qe, "\n")
SVM accuracy: 0.9555556 
# Use the best-performing qeML model for the results table
best_acc_qe <- max(acc_rf_qe, acc_gb_qe, acc_svm_qe)
best_model_qe <- c("Random Forest", "Gradient Boosting", "SVM")[
  which.max(c(acc_rf_qe, acc_gb_qe, acc_svm_qe))
]

results <- rbind(results, data.frame(
  Framework = "qeML",
  Model = paste0(best_model_qe, " (qe wrapper)"),
  Accuracy = best_acc_qe
))
TipKey Strengths
  • One-line model fitting: qeRF(data, "target") — no formula, no matrix, no recipe
  • 20+ algorithms behind a uniform qe*() interface (RF, GBM, SVM, KNN, LASSO, neural nets, and more)
  • qeCompare() lets you benchmark multiple methods in a single call
  • Built-in holdout evaluation
  • Lowest learning curve of any framework showcased here

Results Comparison

All five frameworks were trained on the same 70/30 split of the iris dataset. Here’s how they stack up:

library(knitr)

results <- results %>% arrange(desc(Accuracy))
kable(results, digits = 4, caption = "Test Set Accuracy by Framework")
Test Set Accuracy by Framework
Framework Model Accuracy
tidymodels Random Forest (ranger) 0.9778
h2o AutoML (deeplearning) 0.9778
qeML Random Forest (qe wrapper) 0.9778
caret Random Forest (rf) 0.9556
mlr3 Random Forest (ranger) 0.9556

On a clean, small dataset like iris, accuracy differences are minimal. The real differentiator is the API and workflow each framework provides. On real-world datasets the choice of framework matters more for how you structure your code than for raw accuracy.

Closing thoughts

There is no single “best” ML framework in R and the right choice depends on the task at hand:

  • Start with tidymodels for a modern, composable, production-ready pipeline.
  • Try qeML for the fastest path from data to results.
  • Use h2o for automatic model selection and stacking with minimal effort.
  • Consider mlr3 for rigorous benchmarking and advanced pipeline composition.
  • Stick with caret if maintaining existing code or prefer its battle-tested simplicity.
To leave a comment for the author, please follow the link and comment on their blog: R'tichoke.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Machine Learning Frameworks in R]]>
400507
New R Package {bdlnm} Released on CRAN: Bayesian Distributed Lag Non-Linear Models in R via INLA https://www.r-bloggers.com/2026/04/new-r-package-bdlnm-released-on-cran-bayesian-distributed-lag-non-linear-models-in-r-via-inla/ Fri, 10 Apr 2026 17:14:30 +0000 https://r-posts.com/?p=18982 CRAN, GitHub TL;DR: {bdlnm} brings Bayesian Distributed Lag Non-Linear Models (B-DLNMs) to R using INLA, allowing to model complex DLNMs, quantify uncertainty, and produce rich visualizations. Background Climate change is increasing exposure to extreme environmental conditions such as heatwaves and air pollution. However, these exposures rarely have immediate effects. ...

Continue reading: New R Package {bdlnm} Released on CRAN: Bayesian Distributed Lag Non-Linear Models in R via INLA]]>
[social4i size="small" align="align-left"] -->
[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

CRAN, GitHub

TL;DR: {bdlnm} brings Bayesian Distributed Lag Non-Linear Models (B-DLNMs) to R using INLA, allowing to model complex DLNMs, quantify uncertainty, and produce rich visualizations.

Background

Climate change is increasing exposure to extreme environmental conditions such as heatwaves and air pollution. However, these exposures rarely have immediate effects. For example:

    • A heatwave today may increase mortality several days later
    • Air pollution can have cumulative and delayed impacts

Distributed Lag Non-Linear Models (DLNMs) are the standard framework for studying these effects. They simultaneously model:

    • How risk changes with exposure level (exposure-response)
    • How risk evolve over time (lag-response)

Usually in the presence of non-linear effects, splines are used to define these two relationships. These two basis are then combined through a cross-basis function. 

As datasets become larger and more complex (e.g., studies with different regions and longer time periods), classical approaches show limitations. Bayesian DLNMs extend this framework by:

    • Supporting more flexible model structures
    • Providing full posterior distributions
    • Enabling richer uncertainty quantification

The new {bdlnm} package extends the framework of the {dlnm} package to a Bayesian setting, using Integrated Nested Laplace Approximation (INLA), a fast alternative to MCMC for Bayesian inference.

Installing and loading the package

As of March 2026, the package is available on CRAN:

install.packages("bdlnm")
library(bdlnm)

At least the stable version of INLA 23.4.24 (or newest) must be installed beforehand. You can install the newest stable INLA version by:

install.packages(
  "INLA",
  repos = c(
    getOption("repos"),
    INLA = "https://inla.r-inla-download.org/R/stable"
  ),
  dep = TRUE
)

Now, let’s load all the libraries we will need for this short tutorial:

Load required libraries
# DLNMs and splines
library(dlnm)
library(splines)

# Data manipulation
library(dplyr)
library(reshape2)
library(stringr)
library(lubridate)

# Visualization
library(ggplot2)
library(gganimate)
library(ggnewscale)
library(patchwork)
library(scales)
library(plotly)

# Tables
library(gt)

# Execution time
library(tictoc)

Hands-on example

We use the built-in london dataset with daily temperature and mortality (age 75+) from 2000-2012.

Before fitting any model, it is useful to explore the raw data. This plot shows daily mean temperature and mortality for the 75+ age group in London from 2000 to 2012, providing a first look at the time series we are trying to model:

col_mort <- "#2f2f2f"
col_temp <- "#8e44ad"

# Scaling parameters
a <- (max(london$mort_75plus) - min(london$mort_75plus)) /
  (max(london$tmean) - min(london$tmean))
b <- min(london$mort_75plus) - min(london$tmean) * a

p <- ggplot(london, aes(x = yday(date))) +
  geom_line(
    aes(y = a * tmean + b, color = "Mean Temperature"),
    linewidth = 0.4
  ) +
  geom_line(
    aes(y = mort_75plus, color = "Daily Mortality (+75 years)"),
    linewidth = 0.4
  ) +
  facet_wrap(~year, ncol = 3) +
  scale_y_continuous(
    name = "Daily Mortality (+75 years)",
    breaks = seq(0, 225, by = 50),
    sec.axis = sec_axis(
      name = "Mean Temperature (°C)",
      transform = ~ (. - b) / a,
      breaks = seq(-10, 30, by = 10)
    )
  ) +
  scale_x_continuous(
    breaks = yday(as.Date(paste0(
      "2000-",
      c("01", "03", "05", "07", "09", "11"),
      "-01"
    ))),
    labels = c("Jan", "Mar", "May", "Jul", "Sep", "Nov"),
    expand = c(0.01, 0)
  ) +
  scale_color_manual(
    values = c(
      "Daily Mortality (+75 years)" = col_mort,
      "Mean Temperature" = col_temp
    )
  ) +
  labs(x = NULL, color = NULL) +
  guides(color = "none") +
  theme_minimal() +
  theme(
    axis.title.y.left = element_text(
      color = col_mort,
      face = "bold",
      margin = margin(r = 8)
    ),
    axis.title.y.right = element_text(
      color = col_temp,
      face = "bold",
      margin = margin(l = 8)
    ),
    axis.text.y.left = element_text(color = col_mort),
    axis.text.y.right = element_text(color = col_temp)
  ) +
  transition_reveal(as.numeric(date))

animate(p, nframes = 300, fps = 10, end_pause = 100)



Model overview

Conceptually, DLNMs model:

    • Exposure-response: how risk changes with exposure level

    • Lag-response: how risk unfold over time

A typical model is:

Yt Poisson ( μt )
log ( μt ) = α + cb ( xt , , x tL ) · β + k γk u kt

where:

    • α is the intercept
    • cb(·) is the cross-basis function, defining both the exposure-response and lag-response relationships
    • β are the coefficients associated with the cross-basis terms
    • ukt are time-varying covariates with corresponding coefficients γk

Model specification & setup

Before fitting the model, we have to define the spline-based exposure-response and lag-response functions using the {dlnm} package.

For our example, we will use common specifications in the literature in temperature-mortality studies:

    • Exposure-response: natural spline with three knots placed at the 10th, 75th, and 90th percentiles of daily mean temperature

    • Lag-response: natural spline with three knots equally spaced on the log scale up to a maximum lag of 21 days

# Exposure-response and lag-response spline parameters
dlnm_var <- list(
  var_prc = c(10, 75, 90),
  var_fun = "ns",
  lag_fun = "ns",
  max_lag = 21,
  lagnk = 3
)

# Cross-basis parameters
argvar <- list(
  fun = dlnm_var$var_fun,
  knots = quantile(london$tmean, dlnm_var$var_prc / 100, na.rm = TRUE),
  Bound = range(london$tmean, na.rm = TRUE)
)

arglag <- list(
  fun = dlnm_var$lag_fun,
  knots = logknots(dlnm_var$max_lag, nk = dlnm_var$lagnk)
)

# Create crossbasis
cb <- crossbasis(london$tmean, lag = dlnm_var$max_lag, argvar, arglag)

As it’s commonly done in these scenarios, we will also control for the seasonality of the mortality time series using a natural spline with 8 degrees of freedom per year, which flexibly controls for long-term and seasonal trends in mortality:

seas <- ns(london$date, df = round(8 * length(london$date) / 365.25))

Finally, we also have to define the temperature values for which predictions will be generated:

temp <- round(seq(min(london$tmean), max(london$tmean), by = 0.1), 1)

Fit the model

Fit the previously defined Bayesian DLNM using the function bdlnm(). We draw 1000 samples from the posterior distribution and set a seed for reproducibility:

tictoc::tic()
mod <- bdlnm(
  mort_75plus ~ cb + factor(dow) + seas,
  data = london,
  family = "poisson",
  sample.arg = list(n = 1000, seed = 5243)
)
tictoc::toc()
8.33 sec elapsed

Internally, bdlnm():

    • fits the model using INLA

    • returns posterior samples for all parameters

Minimum mortality temperature

We estimate the minimum mortality temperature (MMT), defined as the temperature at which the overall mortality risk is minimized. This optimal value will later be used to center the estimated relative risks.

tictoc::tic()
mmt <- optimal_exposure(mod, exp_at = temp)
tictoc::toc()
7.3 sec elapsed

The Bayesian framework, compared to the frequentist perspective, provides directly the full posterior distribution of the MMT. It is useful to inspect this distribution to assess whether multiple candidate optimal exposure values exist and to verify that the median provides a reasonable centering value:

ggplot(as.data.frame(mmt$est), aes(x = mmt$est)) +
  geom_histogram(
    fill = "#A8C5DA",
    bins = length(unique(mmt$est)),
    alpha = 0.6,
    color = "white"
  ) +
  geom_density(
    aes(y = after_stat(density) * length(mmt$est) / length(unique(mmt$est))),
    color = "#2E5E7E",
    linewidth = 1.2,
    adjust = 2 # <-- key change: higher = smoother
  ) +
  geom_vline(
    xintercept = mmt$summary["0.5quant"],
    color = "#2E5E7E",
    linewidth = 1.1,
    linetype = "dashed"
  ) +
  scale_x_continuous(breaks = seq(min(mmt$est), max(mmt$est), by = 0.1)) +
  labs(x = "Temperature (°C)", y = "Frequency") +
  theme_minimal()

The posterior distribution of the MMT is concentrated around 18.9ºC and is unimodal, so the median is a stable centering value for the relative risk estimates.

The posterior distribution of the MMT can also be visualized directly using the package’s plot() method: plot(mmt).

Predict exposure-lag-response effects

We predict the exposure-lag-response association between temperature and mortality from the fitted model at the supplied temperature grid:

cen <- mmt$summary[["0.5quant"]]
tictoc::tic()
cpred <- bcrosspred(mod, exp_at = temp, cen = cen)
tictoc::toc()
6.83 sec elapsed

Centering at the MMT means that relative risks (RR) are interpeted relative to this optimal temperature with minimum mortality.

Several visualizations can be produced from these predictions. While simpler visualizations can be created using the package’s plot() method, here we will use fancier ggplot2 visualizations:

3D exposure-lag-response surface

We can plot the full exposure-lag-response association using a 3-D surface:

matRRfit_median <- cpred$matRRfit.summary[,, "0.5quant"]
x <- rownames(matRRfit_median)
y <- colnames(matRRfit_median)
z <- t(matRRfit_median)

zmin <- min(z, na.rm = TRUE)
zmax <- max(z, na.rm = TRUE)
mid <- (1 - zmin) / (zmax - zmin)

plot_ly() |>
  add_surface(
    x = x,
    y = y,
    z = z,
    surfacecolor = z,
    cmin = zmin,
    cmax = zmax,
    colorscale = list(
      c(0, "#00696e"),
      c(mid * 0.5, "#80c8c8"),
      c(mid, "#f5f0e8"),
      c(mid + (1 - mid) * 0.5, "#c2714f"),
      c(1, "#6b1c1c")
    ),
    colorbar = list(title = "RR")
  ) |>
  add_surface(
    x = x,
    y = y,
    z = matrix(1, nrow = length(y), ncol = length(x)),
    colorscale = list(c(0, "black"), c(1, "black")),
    opacity = 0.4,
    showscale = FALSE
  ) |>
  layout(
    title = "Exposure-Lag-Response Surface",
    scene = list(
      xaxis = list(title = "Temperature (°C)"),
      yaxis = list(title = "Lag", tickvals = y, ticktext = gsub("lag", "", y)),
      zaxis = list(title = "RR"),
      camera = list(eye = list(x = 1.5, y = -1.8, z = 0.8))
    )
  )
Click the image to explore the interactive Plotly version

The surface reveals two distinct risk regions. Hot temperatures produce a sharp, acute risk concentrated at the first lags, peaking at lag 0 and dissipating rapidly after the first lags. Cold temperatures produce a more modest and gradual increase in the first lags that does not fully disappear at longer lags. Intermediate temperatures near the MMT sit close to the RR = 1 reference plane across all lags.

The differential lag structure observed for heat- and cold-related mortality is consistent with known physiological mechanisms. Heat-related mortality tends to occur rapidly after exposure due to acute physiological stress, whereas cold-related mortality develops more gradually through delayed cardiovascular and respiratory effects, leading to increasing risk over longer lag periods.

Lag-response curves

We can also visualizes slices of the previous surface. For example, the lag-response relationship for different temperature values:

matRRfit <- cbind(
  melt(cpred$matRRfit.summary[,, "0.5quant"], value.name = "RR"),
  RR_lci = melt(
    cpred$matRRfit.summary[,, "0.025quant"],
    value.name = "RR_lci"
  )$RR_lci,
  RR_uci = melt(
    cpred$matRRfit.summary[,, "0.975quant"],
    value.name = "RR_uci"
  )$RR_uci
) |>
  rename(temperature = Var1, lag = Var2) |>
  mutate(
    lag = as.numeric(gsub("lag", "", lag))
  )

temps <- cpred$exp_at

p <- ggplot() +
  # Lag-responses curves colored by temperature
  geom_line(
    data = matRRfit,
    aes(x = lag, y = RR, group = temperature, color = temperature),
    alpha = 0.35,
    linewidth = 0.35
  ) +
  scale_color_gradientn(
    colours = c(
      "#2166ac",
      "#4393c3",
      "#92c5de",
      "#d1e5f0",
      "#f7f7f7",
      "#fddbc7",
      "#f4a582",
      "#d6604d",
      "#b2182b"
    ),
    name = "Temperature"
  ) +
  # Start a new color scale for highlighted curves
  ggnewscale::new_scale_color() +
  # RR = 1 reference
  geom_hline(
    yintercept = 1,
    linetype = "dashed",
    color = "grey30",
    linewidth = 0.5
  ) +
  scale_x_continuous(breaks = cpred$lag_at) +
  scale_y_continuous(trans = "log10", breaks = pretty_breaks(6)) +
  labs(
    title = "Lag-response curves by temperature",
    x = "Lag (days)",
    y = "Relative Risk (RR)"
  ) +
  theme_minimal() +
  theme(legend.position = "top", panel.grid.minor.x = element_blank()) +
  transition_states(
    temperature,
    transition_length = 1,
    state_length = 0
  ) +
  shadow_mark(past = TRUE, future = FALSE, alpha = 0.6)

animate(p, nframes = 300, fps = 15, end_pause = 100)

Cold temperatures (blue) gradually increase in the initial lags and then decline gradually without fully disappearing in the longer lags. Hot temperatures (red) show a different pattern: a higher risk immediately after lag 0, which drops rapidly and largely dissipates after the first lags:



Exposure-responses curves

We can also plot the exposure-responses curves by lag and the overall cumulative curve across all the lag period:

allRRfit <- data.frame(
  temperature = as.numeric(rownames(cpred$allRRfit.summary)),
  lag = "overall",
  RR = cpred$allRRfit.summary[, "0.5quant"],
  RR_lci = cpred$allRRfit.summary[, "0.025quant"],
  RR_uci = cpred$allRRfit.summary[, "0.975quant"]
)

RRfit <- rbind(matRRfit, allRRfit)

# Split data
RRfit_lags <- RRfit |>
  filter(!lag %in% c("overall")) |>
  mutate(lag = as.numeric(lag))
RRfit_overall <- RRfit |>
  filter(lag %in% c("overall"))

temps <- cpred$exp_at
t_cold <- temps[which.min(abs(temps - quantile(temps, 0.01)))]
t_hot <- temps[which.min(abs(temps - quantile(temps, 0.99)))]

# Top plot: exposure-response curves for each lag and overall
p_main <- ggplot() +
  # Background: all lags, fading from vivid (small) to pale (large)
  geom_line(
    data = RRfit_lags,
    aes(x = temperature, y = RR, group = lag, color = lag),
    linewidth = 0.8
  ) +
  scale_color_gradientn(
    colours = c(
      "black",
      "#2b1d2f",
      "#4a2f5e",
      "#6a4c93",
      "#8b6bb8",
      "#b39cdb",
      "#d8c9f1",
      "#f3eef5"
    ),
    values = scales::rescale(c(0, 0.5, 1, 2, 3, 4, 5, 10, 20))
  ) +
  new_scale_color() +
  new_scale_fill() +
  # Credible intervals
  geom_ribbon(
    data = RRfit_overall,
    aes(
      x = temperature,
      ymin = RR_lci,
      ymax = RR_uci,
      fill = "1"
    ),
    alpha = 0.2
  ) +
  # Highlighted curves
  geom_line(
    data = RRfit_overall,
    aes(x = temperature, y = RR, color = "1"),
    linewidth = 1.2
  ) +
  geom_hline(
    yintercept = 1,
    linetype = "dashed"
  ) +
  scale_color_manual(values = "#a6761d", labels = "Overall (CrI95%)") +
  scale_fill_manual(values = "#a6761d", labels = "Overall (CrI95%)") +
  scale_y_continuous(
    transform = "log10",
    breaks = sort(c(0.8, pretty_breaks(5)(c(0.8, 4))))
  ) +
  labs(
    x = NULL,
    y = "Relative Risk (RR)",
    color = NULL,
    fill = NULL
  ) +
  theme_minimal() +
  theme(
    legend.position = "top",
    axis.text.x = element_blank(),
    plot.margin = margin(8, 8, 0, 8)
  )

# Bottom plot: histogram with percentile lines
p_hist <- ggplot(london, aes(x = tmean)) +
  geom_histogram(
    aes(y = after_stat(density), fill = after_stat(x)),
    binwidth = 0.5,
    color = "black",
    linewidth = 0.2
  ) +
  geom_vline(
    xintercept = t_cold,
    linetype = "dashed",
    color = "#053061",
    linewidth = 0.6
  ) +
  geom_vline(
    xintercept = t_hot,
    linetype = "dashed",
    color = "#67001f",
    linewidth = 0.6
  ) +
  geom_vline(
    xintercept = cen,
    linetype = "dashed",
    color = "grey20",
    linewidth = 0.6
  ) +
  annotate(
    "text",
    x = t_cold,
    y = Inf,
    label = "1st pct",
    vjust = 1.4,
    hjust = 1.1,
    size = 3.2,
    color = "#053061"
  ) +
  annotate(
    "text",
    x = t_hot,
    y = Inf,
    label = "99th pct",
    vjust = 1.4,
    hjust = -0.1,
    size = 3.2,
    color = "#67001f"
  ) +
  annotate(
    "text",
    x = cen,
    y = Inf,
    label = "MMT",
    vjust = 1.4,
    hjust = -0.1,
    size = 3.2,
    color = "grey20"
  ) +
  scale_x_continuous(limits = range(cpred$exp_at)) +
  scale_fill_gradientn(
    colours = c(
      "#053061",
      "#2166ac",
      "#4393c3",
      "#92c5de",
      "#d1e5f0",
      "#f7f7f7",
      "#fddbc7",
      "#f4a582",
      "#d6604d",
      "#b2182b",
      "#67001f"
    ),
    name = "Temperature"
  ) +
  labs(x = "Temperature (°C)", y = "Density") +
  theme_minimal() +
  theme(
    plot.margin = margin(20, 8, 8, 8),
    legend.position = "bottom"
  )

# Combine them:
p_main / p_hist + plot_layout(heights = c(3, 1))


The overall cumulative curve (mustard) is clearly asymmetric: risk increases on both sides of the MMT, but the rise is much steeper for hot temperatures than for cold temperatures. The lag-0 curve (black), which reflects the immediate effect, behaves differently for cold than heat: it is below 1 at cold temperatures (reflecting the delayed nature of cold temperature effects) and increases approximately linearly for heat. The histogram confirms that most London days fall between 5°C and 20°C, so extreme temperatures, despite their high individual risks, are relatively rare events.

Attributable risk

We can also calculate attributable numbers and fractions from a B-DLNM, which allows to quantify the impact of all the observed exposures in 75+ years mortality. We compute the number of mortality events attributable to the temperature exposures (attributable number) and the fraction of all the mortality events it constitutes (attributable fraction).

Two different perspectives can be used:

    • Backward (dir = "back"): what today’s deaths were explained by past temperature exposures?

    • Forward (dir = "forw"): what future deaths will today’s temperature exposure cause?

Let’s use the forward perspective, more commonly used:

tictoc::tic()
attr_forw <- attributable(
  mod,
  london,
  name_date = "date",
  name_exposure = "tmean",
  name_cases = "mort_75plus",
  cen = cen,
  dir = "forw"
)
tictoc::toc()
110.12 sec elapsed

Attributable fraction evolution

We can plot the time series of daily attributable fractions (AF):

col_af <- "black"

temp_colours <- c(
  "#053061",
  "#2166ac",
  "#4393c3",
  "#92c5de",
  "#d1e5f0",
  "#f7f7f7",
  "#fddbc7",
  "#f4a582",
  "#d6604d",
  "#b2182b",
  "#67001f"
)

af_med <- attr_forw$af.summary[, "0.5quant"]

# Pre-compute range once
af_min <- min(af_med, na.rm = TRUE) - 0.01
af_max <- max(af_med, na.rm = TRUE) + 0.01

df <- data.frame(
  date = london$date,
  x = yday(london$date),
  year = year(london$date),
  tmean = london$tmean,
  af = af_med
)

ggplot(df, aes(x = x)) +
  # Full-height temperature background per day
  geom_rect(
    aes(
      xmin = x - 0.5,
      xmax = x + 0.5,
      ymin = af_min,
      ymax = af_max,
      fill = tmean
    )
  ) +
  scale_fill_gradientn(
    colours = temp_colours,
    name = "Temperature (°C)"
  ) +
  # AF line on top
  geom_line(
    aes(y = af),
    color = col_af,
    linewidth = 0.7
  ) +
  scale_y_continuous(
    name = "Attributable Fraction (AF)",
    breaks = seq(0, 1, by = 0.1),
    limits = c(af_min, af_max),
    expand = c(0, 0)
  ) +
  scale_x_continuous(
    breaks = yday(as.Date(paste0(
      "2000-",
      c("01", "03", "05", "07", "09", "11"),
      "-01"
    ))),
    labels = c("Jan", "Mar", "May", "Jul", "Sep", "Nov"),
    expand = c(0, 0)
  ) +
  facet_wrap(~year, ncol = 3, axes = "all_x") +
  labs(x = NULL) +
  theme_minimal(base_size = 11) +
  theme(
    panel.spacing.x = unit(0, "pt"),
    strip.text = element_text(face = "bold", size = 10),
    legend.position = "top",
    legend.key.width = unit(2.5, "cm")
  )



Sharp spikes in AF exceeding 60% are visible in summer 2003 and 2006, coinciding with the major European heatwaves. In general, summer episodes produce higher and more abrupt peaks in AF, whereas cold winter days are associated with more sustained elevations over time, though less pronounced in magnitude.

Total attributable burden

Summing across the full study period, the table quantifies the total mortality burden attributable to non-optimal temperature exposures in the 75+ population:

rbind(
  "Attributable fraction" = attr_forw$aftotal.summary,
  "Attributable number" = attr_forw$antotal.summary
) |>
  as.data.frame() |>
  round(3) |>
  gt(rownames_to_stub = TRUE)
mean sd 0.025quant 0.5quant 0.975quant mode
Attributable fraction 0.174 0.018 0.139 0.175 0.207 0.176
Attributable number 68857.597 7131.526 55071.066 69178.391 81995.459 69842.155
Over the full 2000-2012 period, approximately 17.5% (95% CrI: 13.9%-20.7%) of all deaths in the London 75+ population were attributable to non-optimal temperatures, corresponding to roughly 69,178 deaths (95% CrI: 55,071-81,996).

Conclusions

The {bdlnm} package provides a powerful and accessible implementation of Bayesian Distributed Lag Non-Linear Models in R. By combining the flexibility of DLNMs with full Bayesian inference via INLA, it enables researchers to better quantify uncertainty and fit complex exposure-lag-response relationships. This makes it a valuable tool for studying the health impacts of climate change and other environmental risks in increasingly data-rich settings.

This framework is not limited to environmental epidemiology. In fact it can be applied to any setting involving time-varying exposures and delayed effects (e.g., market shocks may affect asset prices over several days), making it a powerful and general tool for time series analysis.

Development is ongoing. Upcoming features include:

    • Multi-location analyses: pooling exposure-lag-response curves across different cities or regions within a single model
    • Spatial B-DLNMs (SB-DLNM): explicitly modelling spatial heterogeneity in the exposure-lag-response curves of different regions

The package is on CRAN. Bug reports and contributions are welcome via GitHub.

References

    • Gasparrini A. (2011). Distributed lag linear and non-linear models in R: the package dlnm. Journal of Statistical Software, 43(8), 1-20. doi:10.18637/jss.v043.i08.

    • Quijal-Zamorano M., Martinez-Beneito M.A., Ballester J., Marí-Dell’Olmo M. (2024). Spatial Bayesian distributed lag non-linear models (SB-DLNM) for small-area exposure-lag-response epidemiological modelling. International Journal of Epidemiology, 53(3), dyae061. doi:10.1093/ije/dyae061.

    • Rue H., Martino S., Chopin N. (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B, 71(2), 319-392. doi:10.1111/j.1467-9868.2008.00700.x.

    • Gasparrini A., Leone M. (2014). Attributable risk from distributed lag models. BMC Medical Research Methodology, 14, 55. doi:10.1186/1471-2288-14-55.


New R Package {bdlnm} Released on CRAN: Bayesian Distributed Lag Non-Linear Models in R via INLA was first posted on April 10, 2026 at 5:14 pm.
To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: New R Package {bdlnm} Released on CRAN: Bayesian Distributed Lag Non-Linear Models in R via INLA]]>
400475
TheseusPlot 0.2.0: Visualizing Decomposition of Differences in Rate Metrics https://www.r-bloggers.com/2026/04/theseusplot-0-2-0-visualizing-decomposition-of-differences-in-rate-metrics/ Fri, 10 Apr 2026 00:00:00 +0000 https://hoxo-m.github.io/blog/posts/TheseusPlot-0-2-0/

TheseusPlot is an R package that decomposes differences in rate metrics between two groups into contributions from individual subgroups and visualizes the results as a “Theseus Plot”.
The package is inspired by the Ship of Theseus thought experi...

Continue reading: TheseusPlot 0.2.0: Visualizing Decomposition of Differences in Rate Metrics]]>
[social4i size="small" align="align-left"] -->
[This article was first published on HOXO-M Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

TheseusPlot is an R package that decomposes differences in rate metrics between two groups into contributions from individual subgroups and visualizes the results as a “Theseus Plot”.

The package is inspired by the Ship of Theseus thought experiment. It replaces subgroup data step by step, recalculates the overall metric at each step, and interprets each change as that subgroup’s contribution to the overall difference.

Suppose you notice that the click-through rate is lower in 2025 than in 2024 and want to examine how a particular attribute, such as gender, contributed to the change. If you obtain a Theseus Plot like the one below, it suggests that men contributed more to the decline in click-through rate than women.

What’s new in 0.2.0

Version 0.2.0 includes the following changes:

  • a fix for continuous-variable discretization with split = "rate", where bin boundaries for the second group could previously be computed from the first group’s data
  • a fix for the size bar of "Sum of ... other attributes", which could incorrectly use the first group’s counts for both groups
  • a fix for warnings in plot() and plot_flip() when multiple subgroups were tied for the largest absolute contribution
  • suppression of warnings generated internally by waterfalls::waterfall() during plot creation.

I would like to thank Kazuyuki Sano for reporting the first two issues and contributing to their fixes.

Installation

You can install TheseusPlot from CRAN with:

install.packages("TheseusPlot")

Try it out

TheseusPlot may be useful when you want to understand why metrics such as conversion rate, retention rate, or click-through rate changed.

For details on how to use it, please see the package website: https://hoxo-m.github.io/TheseusPlot/.

To leave a comment for the author, please follow the link and comment on their blog: HOXO-M Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: TheseusPlot 0.2.0: Visualizing Decomposition of Differences in Rate Metrics]]>
400577
Using R to Teach R: Lessons for Software Development https://www.r-bloggers.com/2026/04/using-r-to-teach-r-lessons-for-software-development/ Thu, 09 Apr 2026 23:59:00 +0000 https://www.jumpingrivers.com/blog/teaching-r-packages-reporting-gitlab/

As we approach the decennial (10-year) anniversary since Jumping Rivers was founded in 2016, it’s a good time to reflect on what we have achieved in that time and share some lessons learned.
If you have read our blogs previously then you will be aware that Jumping Rivers is a ...

Continue reading: Using R to Teach R: Lessons for Software Development]]>
[social4i size="small" align="align-left"] -->
[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As we approach the decennial (10-year) anniversary since Jumping Rivers was founded in 2016, it’s a good time to reflect on what we have achieved in that time and share some lessons learned.

If you have read our blogs previously then you will be aware that Jumping Rivers is a consultancy and training provider in all things data science. But did you know that we offer over 50 different courses spanning R, Python, Git, SQL and more?

In this blog we will provide a glimpse into our internal process and share how we have streamlined the task of maintaining so many courses. Along the way we will share some good practices applicable to any big coding project, including packaging of source code and automated CI/CD.

The challenge

Let’s start by laying out the key challenges which face us.

1. Multilingual support

Our course catalogue consists of over 50 courses. The majority of these are either based on R or Python or both:

  • 50% R
  • 30% Python
  • 5% R and Python
  • 15% other (Git, SQL, Tableau, Posit and more)

At the very least, any solution that we come up with for standardising our courses must be compatible with both R and Python. Ideally it should also support some less taught languages including SQL and Git.

2. Maintenance

The world of R and Python is constantly changing. The languages themselves receive frequent updates, as do publicly available R packages on CRAN and Python packages on PyPI.

This has the consequence that code which worked one year ago (or even one day) may no longer be functional with the latest package versions. We will need some way to track this and ensure that the code examples covered in our courses remain relevant and error-free.

3. Demand

We deliver over 100 courses per year. For a relatively small team of data scientists, this can be a lot to juggle!

In an ideal world, the process of building the course materials, setting up the cloud environment for training, and managing all of the administration that goes along with this should be automated. That way, the trainer can focus on providing the highest quality experience for the attendees without having to worry about things going wrong on the day.

The solution

Our team is used to setting up data science workflows for clients, including automated reporting and migration of source code into packages. We have therefore applied these techniques in our internal processes, including training.

Automated reporting

You write a document which has to be updated on a regular basis; this might include a monthly presentation showing the latest company revenues. Does this scenario sound familiar?

We could regenerate the plots and data tables and manually copy and paste these into the report document. Even better, we can take advantage of free-to-use automated reporting frameworks including R Markdown and Quarto.

R Markdown and Quarto both work as follows:

  • We provide a “YAML header” at the top of the report document with configuration and formatting options:

    ---
    title: "Introduction to Python"
    authors:
    - "Myles Mitchell"
    date: "2026-04-02"
    output: pdf
    ---
    
  • The report body is formatted as Markdown and supports a mixture of plain text and code:

    ## Introduction
    At it's most basic, Python is essentially a calculator.
    We can run basic calculations as follows:
    ```{python}
    2 + 1
    ```
    We can also assign the output of a calculation to a
    variable so that it can be reused later:
    ```{python}
    x = 2 + 1
    print(x)
    ```
    

Notice that we have included chunks of Python code. By making use of chunk options we can configure code chunks to be executed when rendering the report. Any outputs from the code (plots, tables, summary statistics) can then be displayed.

By migrating the code logic into the report itself, we can update our report assets at the click of a button whenever the data changes.

We have taken inspiration from this approach with our course notes and presentation slides. This forces us to be rigorous with the code examples. Any runtime errors that are produced by faulty or outdated code would be visible in the course notes and by extension to the attendees of our courses.

Crucially for us, R Markdown and Quarto are both compatible with R and Python. They also support syntax highlighting for languages like Git and SQL, as well as a variety of output formats including HTML and PDF.

Flow chart illustrating the automated reporting workflow with Quarto. Starting with a text-based .qmd file, this is converted into a Markdown format using Jupyter or knitr. Pandoc is then used to convert this into a variety of output formats including HTML, PDF and Word.

Internal R packages

So we have settled on a solution for building our course notes. But we have 50 different courses, and setting these up from scratch each time is going to get tedious!

A good practice in any coding project is to avoid duplication as much as possible. Instead of copying and pasting code, we should really be migrating code into functions which are self contained, reusable and easy to test. This will mean fewer places to debug when things inevitably go wrong.

Following a similar philosophy for our training infrastructure, we have migrated any reusable assets for our courses—including logos, template files and styling—into a collection of internal R packages.

When building a new course, the developer can now focus on the aspects that are unique to that course:

  • Code examples
  • Notes
  • Exercises
  • Presentation slides

Everything else is taken care of automatically:

  • The appearance of the course notes and presentation slides.
  • Build routines including converting the R Markdown / Quarto text files into HTML.

In addition to course templates, we also have internal packages for managing the administrative side of training, including:

  • Calculating pricing quotes for clients.
  • Generating post-course certificates.
  • Spinning up a bespoke Posit Workbench environment for the course.
  • Summarising attendee feedback.

And the list goes on!

GitLab CI/CD

With automated reporting and packaging of source code, we have created standardised routines that can be applied to any of our courses.

This does not change the fact that we have over 50 courses to maintain. We still need a way of testing our courses and tracking issues. This is where CI/CD (Continuous Integration / Continuous Development and Deployment) comes in.

CI/CD defines a framework for software development, including:

  • Automated unit testing.
  • Branching of source code and code review.
  • Versioning and deployment of software.

If you maintain software then you have likely come across version control with Git. Cloud platforms like GitLab and GitHub provide tools for collaborative code development. Not only do they provide a cloud backup of your source code, they also provide the following features:

  • CI/CD tools for automated testing, build and deployment.
  • Branch rules for enforcing good practices like code review and unit testing.
  • Versioning and tagging of source code.

Each of our courses is maintained via it’s own GitLab repository. The CI/CD pipelines for our courses are defined in a separate repository along with the internal R packages mentioned above.

Flow chart illustrating how we have standardised our GitLab training repositories. The templates are defined in a central repository and pushed downstream to our course repositories.

When setting up a new course, the course repository will be automatically populated with the template CI/CD rules. All courses are therefore subject to the same stringent checks, including:

  • Ensuring that the course notes build without errors.
  • Enforcing code review of any course updates before these are merged into the main branch.
  • Building and storing the artifacts (the rendered HTML notes and coding scripts) for the latest version of the course.

These checks are triggered by any updates to a course. We also schedule monthly CI/CD pipelines for all courses, with any issues immediately flagged to our trainers.

We have also taken advantage of GitLab’s folder-like structure for organising code repositories. Within the Jumping Rivers project on GitLab, we have a subproject called “training”. All of our course-related repositories are located “downstream” from this project. This means that any settings or environment variables defined at the “training” level are automatically applied to all of our courses.

In summary

The take-home lessons from this blog are applicable to any big coding project:

  • Avoid duplication: migrate any reusable logic or assets into standalone packages.
  • Utilise CI/CD workflows using GitLab, GitHub or similar.
  • Focus on what matters by automating as much of the process as possible.

Our training infrastructure has taken 10 years to build and is still constantly evolving; we have not even covered the full process in this blog! For a deeper dive, check out this talk by Myles at SatRdays London 2024.

For more on automated reporting, check out:

For more on packaging of source code, check out:

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Using R to Teach R: Lessons for Software Development]]>
400414
Hold On Hope: publication lag times at cell biology journals https://www.r-bloggers.com/2026/04/hold-on-hope-publication-lag-times-at-cell-biology-journals/ Thu, 09 Apr 2026 11:22:53 +0000 https://quantixed.org/?p=3723 I’ve posted about publication lag times previously. The “lag” refers to the time from submitting a paper and it appearing in a journal. Publication lag times are still a frustration for researchers. Although preprints circumvent the delay in sharing science with others, publication is still king when it comes ...
Continue reading: Hold On Hope: publication lag times at cell biology journals]]>
[social4i size="small" align="align-left"] -->
[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve posted about publication lag times previously. The “lag” refers to the time from submitting a paper and it appearing in a journal.

Publication lag times are still a frustration for researchers. Although preprints circumvent the delay in sharing science with others, publication is still king when it comes to evaluation. Contracts are short and publication delays can be long…

I recently saw a post comparing median publication lag times for genetics journals. This motivated me to update my code and rerun the analysis for cell biology journals to see what, if anything, has changed.

Methodology

I wrote an R package PubMedLagR which uses {rentrez} to retrieve the publication data from PubMed. Once we have this data, it is a matter of producing some plots which I have covered previously. To ensure that we are only looking at research papers, we use "journal article"[pt] as a search term, and also remove from the data anything else (Reviews, Commentaries etc.). Feel free to use it to look at other journals.

Caveats

Before we get started, there are some caveats. The analysis is only as good as the PubMed data. Not all journals submit their date information to PubMed, while for others it is incomplete (as we’ll see below). There are inaccuracies for sure. I found a paper that was supposedly submitted on 1970-01-01, more than 40 years before the journal started. Also, it’s well known that some journals “restart the clock” on a paper by rejecting it and allowing resubmission, then only counting the resubmitted version. So, comparison between journals is a little tricky, but it allows us to look at trends.

Let’s dive in

We can use the following code to grab the data we are interested in.

library(PubMedLagR)
jrnl_list <- c("J Cell Sci", "Mol Biol Cell", "J Cell Biol", "Nat Cell Biol",
               "EMBO J", "Biochem J", "Dev Cell", "FASEB J", "J Biol Chem",
               "Cells", "Front Cell Dev Biol", "Nature Communications",
               "Cell Reports", "Mol Cell", "Autophagy", "Cell Death Differ",
               "Cell Death Dis", "Cell Res", "Sci Adv", "Cell")
yrs <- 2006:2026
retrieve_journal_year_records(jrnl_list, yrs, batch_size = 250)
pprs <- pubmed_xmls_to_df()

The list of journals is somewhat arbitrary. I have included Nature Communications and Science Advances although they carry many other papers besides cell biology. I included Cell (even though there’s not much cell biology in there these days) and left out Nature and Science.

How many papers are in each of these journals and how has that changed over time?

There’s a huge increase in the number of papers published by Nature Communications, Science Advances and Cell Reports. Nature Communications is a behemoth, publishing ~12,500 papers in 2025.

There was a boom and bust in publications at Front Cell Dev Biol and Cells which could be due to reputational problems (like this). Other journals have declined. Most noticeably J Biol Chem, but others have taken a hit. The reasons behind these dynamics are discussed elsewhere.

Median publication lag time

We’ll use received-to-published as our measure of publication lag time. This is the time from submission to it appearing in the journal “in print”. It’s measured here in days.

For the last 5 years, the median lag time at Nature Cell Biology is over one year. Other outlets are close to this, e.g. Cell, Dev Cell; whereas others linger around 200 days, or lower in the case of J Cell Sci, FASEB J et al. The journal Autophagy is missing here because there are no data for it. Others, like Sci Adv, MBoC have only minimal data available. The shortest lag times were for Cells (which might not surprise some people).

The lion’s share of this lag time is the time from submission to acceptance (received-accepted), with a small contribution from the time taken to formally publish the article (accepted-published).

JournalYearAccepted-PublishedRecieved-PublishedRecieved-Accepted
Biochem J20256125116
Cell202529308.5275.5
Cell Death Differ202513235221
Cell Death Dis202518218193
Cell Rep202523233209
Cell Res202534208.5172
Cells2025165741
Dev Cell202527322296
EMBO J202527222196
FASEB J202513131117
Front Cell Dev Biol20253511171
J Biol Chem20259128117
J Cell Biol202535259218
J Cell Sci202513168.5155
Mol Cell202527258231
Nat Cell Biol202548381.5327
Nat Commun202520266241

This is using the median time. Obviously, some papers whizz straight in, whereas others… don’t. Let’s have a look.

You can see many examples of papers that have lag times of 1, 2 or 3 years. I scaled all the plots to a maximum lag time of 3 years. There were several examples of papers with lag times of up to 7 years that looked to be genuine, but they distorted the view.

So what trends can we see?

The lag times are creeping up at some journals, but not at others. It’s possible to see some very short lag times (in amongst the longer ones) recently at Dev Cell, Mol Cell and EMBO J. I assume these are transfers in to the journal, leading to rapid publication. Although Cell also has many of those too, so perhaps there are other explanations.

To click though in more detail, I made some graphics for each journal which use ridgelines to look at the profile of lag times for each year.

And finally

Incidentally, nine of the top ten papers with the longest lag times were published in Nature Communications. The longest was this one (3263 days). Almost nine years! All papers have their battle stories and I’m sure this one has a tale to tell.

The post title comes from “Hold On Hope” by Guided by Voices.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Hold On Hope: publication lag times at cell biology journals]]>
400426
Developer Engagement and Bioconductor https://www.r-bloggers.com/2026/04/developer-engagement-and-bioconductor/ Thu, 09 Apr 2026 00:00:00 +0000 https://blog.bioconductor.org/posts/2026-04-09-developer-engagement/ Introduction
During the Chan Zuckerberg Institute’s Essential Open Source Software for Science cycle 6 funding round, the Bioconductor Community Manager, Maria Doyle, secured a grant to fund a developer engagement position for Bioconductor, and...

Continue reading: Developer Engagement and Bioconductor]]>
[social4i size="small" align="align-left"] -->
[This article was first published on Bioconductor community blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

During the Chan Zuckerberg Institute’s Essential Open Source Software for Science cycle 6 funding round, the Bioconductor Community Manager, Maria Doyle, secured a grant to fund a developer engagement position for Bioconductor, and I was fortunate enough to be offered that role. I am Nick Cooley, and I’m excited to see what this role can bring to Bioconductor. My background is relatively diverse, I received my PhD in organic chemistry from the University of Missouri, and I worked on prokaryotic genomics and functional genomics at the University of Pittsburgh from 2017 to 2025.

Role responsibilities

The mandate of this role is somewhat broad. Bioconductor, and academic computing generally face a myriad of distinct and interrelated challenges as hardware, computing paradigms, and education environments change rapidly. Improving developer resources for tackling new and existing challenges, modernizing Bioconductor developer onboarding materials (particularly for early career researchers), and improving recognition mechanisms for community members who volunteer time and effort to the Bioconductor project are all general themes within the role scope.

Some Specific Efforts

A few of the specific efforts I’ll be working on in this role include:

Developer Forum

The Developer Forum had previously been run on a volunteer basis, and served as a community resource for discussing technical and infrastructure issues, concerns, and opportunities. The creation of the Developer Engagement Lead allowed us include the Forum as direct responsibility of this role.

Developer Champions Program

Bioconductor working groups have been a pillar of Bioconductor for a while, and represent a considerable amount of volunteer work towards the project. Improving the visibility of the working groups themselves, and the recognition that project contributors receive for their participation in the working groups can go a long way towards ensuring that that work is valued by contributors home institutions and funding mechanisms. The Champions Program aims to create a clear recognition mechanism for those volunteer efforts.

Bioconductor hackathon events

Community and collaboration are irreplaceable engines of strong research. Many Bioconductor contributors find community and collaboration within their own disciplines or institutions. Providing an avenue for collaborative and technical events within Bioconductor can fill persistent gaps in the the research tooling present in the project, and present networking opportunities for early career researchers. Part of this role is planning and running these events.

Bioconductor documentation and LLMs

The ways that researchers search for information, tools, and workflow examples are changing with the rise of large language models and their interfaces. There are opportunities for improving how bioinformaticians, especially those outside of the Bioconductor community, find and familiarize themselves with research solutions within the Bioconductor project, including through improvements to website search and documentation discoverability. A long term goal of this role is to work on documentation templates and checking tools to improve their searchability by LLMs, and explore the feasibility of Bioconductor sanctioned and managed LLMs.

How to get in touch

For developer discussions and ideas, the Bioconductor Zulip is the best place to connect.

© 2026 Bioconductor. Content is published under Creative Commons CC-BY-4.0 License for the text and BSD 3-Clause License for any code. | R-Bloggers

To leave a comment for the author, please follow the link and comment on their blog: Bioconductor community blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Developer Engagement and Bioconductor]]>
400428
Collaborating between Bioconductor and R-universe on Development of Common Infrastructure https://www.r-bloggers.com/2026/04/collaborating-between-bioconductor-and-r-universe-on-development-of-common-infrastructure-2/ Wed, 08 Apr 2026 00:00:00 +0000 https://blog.bioconductor.org/posts/2026-04-08-r-universe-collaboration/

This article is cross-posted on rOpenSci and R-Consortium blogs.
For more than two decades, the Bioconductor project has been a cornerstone of the R ecosystem, providing high-quality, peer-reviewed tools for bioinformatics and computational biol...

Continue reading: Collaborating between Bioconductor and R-universe on Development of Common Infrastructure]]>
[social4i size="small" align="align-left"] -->
[This article was first published on Bioconductor community blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This article is cross-posted on rOpenSci and R-Consortium blogs.

For more than two decades, the Bioconductor project has been a cornerstone of the R ecosystem, providing high-quality, peer-reviewed tools for bioinformatics and computational biology. Its curated repository model, rigorous review standards, and tightly coordinated release process have helped establish Bioconductor as one of the most trusted distribution channels in scientific computing.

However, the infrastructure that supports such a long-standing and large-scale project inevitably accumulates technical debt. Legacy build systems, bespoke tooling, and historically grown workflows add up to costly and unsustainable maintenance work. For this reason, Bioconductor is collaborating with R-universe to gradually modernize parts of its infrastructure, while accommodating the project’s scale, governance, and established processes. In turn, Bioconductor is helping R-universe expand and refine its features as we learn to serve the complex needs of the Bioconductor community.

This collaboration reflects a core principle of R-universe as an R Consortium Infrastructure Steering Committee (ISC) top-level project: supporting reviewed package repositories such as rOpenSci and Bioconductor, and providing modern, open, and reusable infrastructure that strengthens the broader R ecosystem.

A Shared Mission: Tooling for Managed Repositories

R-universe was designed as a next-generation package distribution and build system for R. It provides:

  • Continuous building and checking of R packages across platforms
  • Binary packages for Windows, macOS, Linux, and WebAssembly
  • Transparent and reproducible build environments managed via GitHub actions
  • Dashboards and metadata APIs for monitoring ecosystem health and activity
  • CRAN-like package repositories with discoverable metrics and documentation

From the outset, a key objective has been to support curated and reviewed communities — such as rOpenSci and Bioconductor — by offering modern infrastructure without requiring them to redesign their governance model or review processes.

For Bioconductor, this means incrementally introducing piece-wise functionality, with consideration for established release cycles and quality control mechanisms:

  1. Setting up independent build and dashboard tooling, replicating processes from the current Bioconductor build systems on R-universe infrastructure
  2. Mirroring Windows and macOS binaries produced on R-universe to Bioconductor
  3. Exploring further integration of results and metadata produced by R-universe for Bioconductor health/activity monitoring and aiding the curation processes
  4. Potential future steps toward deeper automation and harmonization

By taking small gradual steps towards adopting R-universe components, everyone gets the opportunity to experiment with new tooling and evaluate where adjustments may be needed in order to minimize disruption to existing practices.

An important milestone in this venture is that Bioconductor now uses R-universe to build the Windows and macOS binaries, which significantly reduces costs and the maintenance load on the Bioconductor team. Beyond binary distribution, we are currently exploring deeper integration of R-universe’s continuous check results into Bioconductor’s quality control and release processes.

Two Universes: Release and Development

Bioconductor maintains two distinct repositories:

  • A release branch for stable packages
  • A devel branch for ongoing development and the next release cycle

To mirror this structure, we currently operate two dedicated R-universe instances:

These universes integrate directly with Bioconductor’s existing Git infrastructure and provide continuous builds for packages in both branches.

Through the R-universe dashboard, package maintainers and users can:

  • Inspect cross-platform check results
  • Review extended BiocCheck diagnostics
  • Monitor build logs and dependency graphs
  • Explore rich package metadata and metrics
  • Publish binary packages for Windows, macOS, and Linux

This provides a familiar yet modern interface for Bioconductor contributors, aligned with what users increasingly expect from contemporary R package infrastructure.

Information about each package is available on https://bioc.r-universe.dev/{pkgname}. For example, https://bioc.r-universe.dev/DESeq2 provides details on the DESeq2 package as shown below:

screenshot of r-universe

screenshot of r-universe

If this is your first time visiting R-universe, we recommend clicking the “Website Tour” button which will walk you through the most important information in 1 or 2 minutes.

Technical Documentation for Bioconductor Maintainers

The R-universe project maintains comprehensive technical documentation at https://docs.r-universe.dev. For Bioconductor specifically, we created a dedicated section summarizing the most relevant topics for developers to get started with R-universe: https://docs.r-universe.dev/bioconductor/

As the collaboration evolves and new components get introduced, the documentation will continue to be expanded. The goal is to provide Bioconductor maintainers with a clear reference point for understanding how R-universe fits into their development workflow, while maintaining compatibility with the established practices that have made Bioconductor a successful project within the R community.

Looking Ahead

Adopting new infrastructure inevitably involves adjustments. For Bioconductor developers, integrating with a new build and distribution system will likely require some changes to workflows, and time to become familiar with new or different package checks, build diagnostics, and binary distribution.

However, by gradually moving toward common infrastructure, the Bioconductor project will benefit from improvements that are being continuously developed and maintained for the broader R ecosystem. A system based on modern continuous integration (CI) will provide developers with improved tooling, and will give the core team more time to focus on community coordination and quality control, rather than on maintaining costly infrastructure. At the same time, the shared platform provided by R-universe can help to increase the visibility and accessibility of Bioconductor software to the greater R community.

We look forward to continuing this alliance and to working with the Bioconductor community to ensure that the next generation of infrastructure supports the project for many years to come.

© 2025 Bioconductor. Content is published under Creative Commons CC-BY-4.0 License for the text and BSD 3-Clause License for any code. | R-Bloggers

To leave a comment for the author, please follow the link and comment on their blog: Bioconductor community blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Collaborating between Bioconductor and R-universe on Development of Common Infrastructure]]>
400403
Collaborating between Bioconductor and R-universe on Development of Common Infrastructure https://www.r-bloggers.com/2026/04/collaborating-between-bioconductor-and-r-universe-on-development-of-common-infrastructure/ Wed, 08 Apr 2026 00:00:00 +0000 https://ropensci.org/blog/2026/04/08/r-universe-bioc/ For more than two decades, the Bioconductor project has been a cornerstone of the R ecosystem, providing high-quality, peer-reviewed tools for bioinformatics and computational biology. Its curated repository model, rigorous review standards, and tight...
Continue reading: Collaborating between Bioconductor and R-universe on Development of Common Infrastructure]]>
[social4i size="small" align="align-left"] -->
[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For more than two decades, the Bioconductor project has been a cornerstone of the R ecosystem, providing high-quality, peer-reviewed tools for bioinformatics and computational biology. Its curated repository model, rigorous review standards, and tightly coordinated release process have helped establish Bioconductor as one of the most trusted distribution channels in scientific computing.

However, the infrastructure that supports such a long-standing and large-scale project inevitably accumulates technical debt. Legacy build systems, bespoke tooling, and historically grown workflows add up to costly and unsustainable maintenance work. For this reason, Bioconductor is collaborating with R-universe to gradually modernize parts of its infrastructure, while accommodating the project’s scale, governance, and established processes. In turn, Bioconductor is helping R-universe expand and refine its features as we learn to serve the complex needs of the Bioconductor community.

This collaboration reflects a core principle of R-universe as an R Consortium Infrastructure Steering Committee (ISC) top-level project: supporting reviewed package repositories such as rOpenSci and Bioconductor, and providing modern, open, and reusable infrastructure that strengthens the broader R ecosystem.

A shared mission: Tooling for managed repositories

R-universe was designed as a next-generation package distribution and build system for R. It provides:

  • Continuous building and checking of R packages across platforms
  • Binary packages for Windows, macOS, Linux, and WebAssembly
  • Transparent and reproducible build environments managed via GitHub actions
  • Dashboards and metadata APIs for monitoring ecosystem health and activity
  • CRAN-like package repositories with discoverable metrics and documentation

From the outset, a key objective has been to support curated and reviewed communities — such as rOpenSci and Bioconductor — by offering modern infrastructure without requiring them to redesign their governance model or review processes.

For Bioconductor, this means incrementally introducing piece-wise functionality, with consideration for established release cycles and quality control mechanisms:

  1. Setting up independent build and dashboard tooling, replicating processes from the current Bioconductor build systems on R-universe infrastructure
  2. Mirroring Windows and macOS binaries produced on R-universe to Bioconductor
  3. Exploring further integration of results and metadata produced by R-universe for Bioconductor health/activity monitoring and aiding the curation processes
  4. Potential future steps toward deeper automation and harmonization

By taking small gradual steps towards adopting R-universe components, everyone gets the opportunity to experiment with new tooling and evaluate where adjustments may be needed in order to minimize disruption to existing practices.

An important milestone in this venture is that Bioconductor now uses R-universe to build the Windows and macOS binaries, which significantly reduces costs and the maintenance load on the Bioconductor team. Beyond binary distribution, we are currently exploring deeper integration of R-universe’s continuous check results into Bioconductor’s quality control and release processes.

Two Universes: Release and Development

Bioconductor maintains two distinct repositories:

  • A release branch for stable packages
  • A devel branch for ongoing development and the next release cycle

To mirror this structure, we currently operate two dedicated R-universe instances:

These universes integrate directly with Bioconductor’s existing Git infrastructure and provide continuous builds for packages in both branches.

Through the R-universe dashboard, package maintainers and users can:

  • Inspect cross-platform check results
  • Review extended BiocCheck diagnostics
  • Monitor build logs and dependency graphs
  • Explore rich package metadata and metrics
  • Publish binary packages for Windows, macOS, and Linux

This provides a familiar yet modern interface for Bioconductor contributors, aligned with what users increasingly expect from contemporary R package infrastructure.

Information about each package is available on https://bioc.r-universe.dev/{pkgname}. For example, https://bioc.r-universe.dev/DESeq2 provides details on the DESeq2 package as shown below:

screenshot of r-universe package

If this is your first time visiting R-universe, we recommend clicking the “Website Tour” button which will walk you through the most important information in 1 or 2 minutes.

Technical Documentation for Bioconductor Maintainers

The R-universe project maintains comprehensive technical documentation at https://docs.r-universe.dev. For Bioconductor specifically, we created a dedicated section summarizing the most relevant topics for developers to get started with R-universe: https://docs.r-universe.dev/bioconductor/

As the collaboration evolves and new components get introduced, the documentation will continue to be expanded. The goal is to provide Bioconductor maintainers with a clear reference point for understanding how R-universe fits into their development workflow, while maintaining compatibility with the established practices that have made Bioconductor a successful project within the R community.

Looking Ahead

Adopting new infrastructure inevitably involves adjustments. For Bioconductor developers, integrating with a new build and distribution system will likely require some changes to workflows, and time to become familiar with new or different package checks, build diagnostics, and binary distribution.

However, by gradually moving toward common infrastructure, the Bioconductor project will benefit from improvements that are being continuously developed and maintained for the broader R ecosystem. A system based on modern continuous integration (CI) will provide developers with improved tooling, and will give the core team more time to focus on community coordination and quality control, rather than on maintaining costly infrastructure. At the same time, the shared platform provided by R-universe can help to increase the visibility and accessibility of Bioconductor software to the greater R community.

We look forward to continuing this alliance and to working with the Bioconductor community to ensure that the next generation of infrastructure supports the project for many years to come.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Collaborating between Bioconductor and R-universe on Development of Common Infrastructure]]>
400401
EM-DAT, the world’s disaster memory, is at risk https://www.r-bloggers.com/2026/04/em-dat-the-worlds-disaster-memory-is-at-risk/ Tue, 07 Apr 2026 00:00:00 +0000 https://statsandr.com/blog/em-dat-the-world-s-disaster-memory-is-at-risk/

I do not usually write posts that are calls to action. But sometimes, something important enough comes along that it would feel wrong to stay silent. This is one of those times.

What is EM-DAT?
EM-DAT, the Emergency Events Database, is the world’s...

Continue reading: EM-DAT, the world’s disaster memory, is at risk]]>
[social4i size="small" align="align-left"] -->
[This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I do not usually write posts that are calls to action. But sometimes, something important enough comes along that it would feel wrong to stay silent. This is one of those times.

What is EM-DAT?

EM-DAT, the Emergency Events Database, is the world’s most widely used and trusted global database for tracking natural and technological disasters. It has been maintained since 1988 by the Centre for Research on the Epidemiology of Disasters (CRED), which is part of UCLouvain.

The database currently contains data on the occurrence and impacts of over 27,000 mass disasters worldwide, from 1900 to the present day. It covers floods, storms, earthquakes, droughts, wildfires, extreme temperatures, landslides, volcanic activity, and technological accidents, across virtually every country on earth.

Crucially, it is:

  • Open access (for non-commercial use)
  • Globally comparable, using transparent and consistent inclusion criteria
  • Cross-verified across multiple sources (UN agencies, NGOs, reinsurance companies, research institutes, press agencies)
  • The reference dataset for thousands of peer-reviewed studies, national risk assessments, and international policy processes

If you have ever read a paper or report about global disaster trends, the probability is high that EM-DAT was the data source behind it.

Why is it at risk?

For more than 25 years, EM-DAT was primarily funded by the United States Agency for International Development (USAID). Following the recent dismantling of USAID, that funding is gone, and no sustainable alternative has been secured.

This is not a minor budget shortfall. Without a replacement funding mechanism, EM-DAT risks shutting down entirely.

Why does it matter?

The open letter drafted in support of EM-DAT puts it well: in an era of intensifying climate extremes, cascading risks, and compounding crises, reliable data are not a luxury. They are the infrastructure for informed decision-making.

Concretely, EM-DAT underpins:

  • Disaster risk reduction and prevention policies, used by governments to assess national risks and prioritise investments
  • Humanitarian operations, relied upon by multilateral agencies and NGOs to plan and forecast needs
  • Climate research, providing historical baselines for understanding trends in extreme weather events
  • Monitoring of global commitments, such as the Sendai Framework for Disaster Risk Reduction, the SDGs, and the Paris Agreement
  • Insurance and risk modelling, used by the private sector alongside other data to benchmark losses and refine exposure models

EM-DAT’s value is not just in the quantity of records. It lies in the rigour and consistency of its methodology over time and across countries. That is exactly what makes it irreplaceable. In a world awash with data, curated and quality-controlled datasets of this kind are rare. If EM-DAT were to close, the result would not be a smooth substitution. It would be fragmentation, proprietary data silos, and reduced access, particularly for lower-income countries that are already under-represented in global evidence.

A personal note

I signed the open letter after being informed of the issue by my colleague Prof. Niko Speybroeck, a leading epidemiologist at UCLouvain and program director of the CRED.

I do not have direct expertise in disaster epidemiology. But I do care about open data, open science, and the integrity of global research infrastructure. And EM-DAT is exactly the kind of resource that the whole scientific community relies on, often without fully realising it.

How you can help

If you share these values, I encourage you to sign the open letter: “The World’s collective disaster memory must be preserved”.

The letter calls on governments, multilateral development banks, philanthropic foundations, and international organisations to step forward with a coordinated and sustainable funding arrangement for EM-DAT. The cost of maintaining the world’s primary disaster database is modest set against the billions spent on disaster response and recovery each year. The cost of losing it would be profound.

Please also consider sharing this post or the open letter with your own network (researchers, policymakers, students, practitioners, or anyone who cares about data-driven approaches to global challenges).

More information

As always, if you have any thoughts or questions related to this post, feel free to leave a comment below.

To leave a comment for the author, please follow the link and comment on their blog: R on Stats and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: EM-DAT, the world’s disaster memory, is at risk]]>
400384
Marathon Man: how to pace a marathon https://www.r-bloggers.com/2026/04/marathon-man-how-to-pace-a-marathon/ Mon, 06 Apr 2026 13:36:35 +0000 https://quantixed.org/?p=3654 How does the average marathoner pace their race? In this post, we’ll use R to have a look at a large dataset of marathon times to try to answer this question. The ideal strategy would be to “even split” the race. This is where you run continually at the ...
Continue reading: Marathon Man: how to pace a marathon]]>
[social4i size="small" align="align-left"] -->
[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How does the average marathoner pace their race? In this post, we’ll use R to have a look at a large dataset of marathon times to try to answer this question.

The ideal strategy would be to “even split” the race. This is where you run continually at the same pace from kilometre 0 to the finish. Let’s forget about “negative splitting”. This is where you speed up through the race, usually by running at a constant pace for the first half or three-quarters and then increasing the pace. Negative splits are for the pros not mere mortals! The difficulty with even-splitting the race is that it is very hard to know what pace you can maintain. The marathon gets hard for everyone after 30 km, so a slow down is almost inevitable. Certainly if you have started too fast you will fade. This situation is known as “positive splitting”.

Why is it so hard to know what pace you can maintain? Well, you can predict a pace based on existing races e.g. half marathon, and there are various ways to do this, but it is difficult to tell if you can hold that pace for the marathon. It’s such a brutal event that training up to run one takes time and it equally takes a while to recover, so experimentation is limited. Running a full marathon (at pace) in training, is not advised. So determining an ideal pace involves quite a bit of guesswork.

Let’s take a look at a big dataset of marathon times – we’ll use the New York City Marathon from 2025 – to see if we can understand how to pace a marathon. There’s an available dataset of chip times (meaning we don’t have to worry about dodgy GPS data) and the course has similar first and second half profiles, allowing us to use these times to understand negative/even/positive splitting. Let’s dive in.

You can skip to the code to play along or just see the analysis here.

First we can see using histograms of the difference between second half and first half of the marathon, that most runners positive split the marathon. There are very few runners who run a negative-split (blue bars, left of the dashed line). More runners even-split (yellow), but the majority run positive (red) split times.

For marathoners with finishing in times of below 3 h, the modal split is only +2 minutes. Over 21.1 km this is only a loss of 6 s per km. For marathoners with finishes of over three hours, this loss gets more severe. Those finishing outside of 5 h, ship 20 minutes or more in the second half.

At first glance this looks like better pace management by the faster runners, but these positive splits could be proportional to the paces being run. In other words, a slower runner should ship more time in the second half, because they’re running more slowly.

We can look at this data a different way and directly compare the first and second half times for each runner. Again this highlights just how few runners negative- or even-split the marathon. Most are positive splitting and are in the upper left half of the plot. We can also see that the data veers away from the ideal even-split (dashed line) with the slower paces. This veering looks linear (straight line).

We can fit a line to this data, and constrain it to go through (1,1) i.e. a 2 h marathoner even-splitting the race. To do this in R we can use lm(formula = I(y - 60) ~ I(x - 60) + 0, data = fitting) and this gives the coefficient for I(x – 60) as 1.24. This is essentially the fade co-efficient for the average runner in the 2025 edition of this race.

What does that mean? Well, for a runner achieving a 90 minute first half, their second half would most likely be: 60 + 1.239 * (90 – 60) = 97.17 minutes, so this would be a finish time of 3:07:10.

For anyone looking to run a 3 h New York Marathon, the average runner would therefore need to run 60 / 2.239 + 60 = 86.8 minutes for the first half to anticipate the fade. So 1:26:48 for the first half, and then 1:33:12 for the second half.

A more simple calculation is to take the mean of the ratio between the two half times for everyone in the dataset. This gives a fade coefficient of 1.13. The difference between these two fade co-efficients is due to the lack of constraint used in the fit. The ratio predicts a positive split being inevitable for the fastest runners, which is probably not true. Anyhow, this puts the first half time at 88 minutes for folks looking to run 3 h. These fade co-efficients are good predictors for a range of times, and I suspect would be similar at other marathon events with a similar profile. You can use them to calculate your ideal pace for a target finish time.

Finally, for the most accurate answer about sub-3 h pacing, we can look directly at runners finishing between 02:50:00 and 03:00:00 and see what they actually ran. The median first half time was 86.3 min (IQR = 84.4 – 87.87) and the second half was 89.62 (88.07 – 91.12). This gives a median finish time of 2:56:00. So running a 1:26:18 first half would give someone their best chance of finishing in under 3 h, allowing for the inevitable fade.

The takeaway message is: to finish within a goal time, do not assume even splits. That is, if you want to run 3 hours 30 min and bank on 90 minutes per half (4:59/km), you will most likely fail to hit the target. Build in a buffer of time to allow for the inevitable fade. A pace of 4:45/km is a better target pace (see below).

Good luck!

Finish TimeEven split paceTarget pace
03:00:0000:04:1600:04:07
03:30:0000:04:5900:04:45
04:00:0000:05:4100:05:23
04:30:0000:06:2400:06:01
05:00:0000:07:0700:06:39
06:00:0000:08:3200:07:55

The code

This analysis was possible thanks to the uploader for making the chip time data available. Also, a shoutout to Nicola Rennie for sharing how to style social media handles in {ggplot2} graphics. This part of my code requires my {qBrand} library and should be skipped if you are running the code yourself (remove the caption = cap argument in the ggplot calls).

library(ggplot2)
library(ggtext)

sysfonts::font_add_google("Roboto", "roboto")
showtext::showtext_auto()

## data wrangling ----

# load csv file from url
url <- paste0("https://huggingface.co/datasets/donaldye8812/",
              "nyc-2025-marathon-splits/resolve/main/",
              "nyrr_marathon_2025_summary_56480_runners_WITH_SPLITS.csv")
df <- read.csv(url)

# the data frame is a long table
# we need to grab the time values where splitCode is "HALF" or "MAR"
df <- df[df$splitCode %in% c("HALF", "MAR"), c("RunnerID", "splitCode", "time")]
# reshape to wide format, values are in time
df <- reshape(df, idvar = "RunnerID", timevar = "splitCode", direction = "wide")
# calculate the split times in minutes
df$split_HALF <- as.numeric(
  as.difftime(df$time.HALF, format = "%H:%M:%S", units = "mins"))
df$split_MAR <- as.numeric(
  as.difftime(df$time.MAR, format = "%H:%M:%S", units = "mins"))
# calculate the second half time
df$split_SECOND_HALF <- df$split_MAR - df$split_HALF
# remove rows with NA values
df <- df[!is.na(df$split_SECOND_HALF), ]
# calculate the difference
df$Difference <- df$split_SECOND_HALF - df$split_HALF
# difference as a fraction of first half
df$Difference_Fraction <- df$Difference / df$split_HALF * 100
# classify into sub 3 hr, sub 4 hr, sub 5 hr, sub 6 hr, over 6 hr
df$Category <- cut(df$split_MAR,
                           breaks = c(0, 180, 210, 240, 300, Inf),
                           labels = c("Sub 3 h", "3:00-3:30", "3:30-4:00",
                                      "4:00-5:00", "Over 5 h"))

## plot styling ----

social <- qBrand::qSocial()
cap <-  paste0(
  "**Data:** New York City Marathon 2025 Results<br>**Graphic:** ",social
)

my_palette <- c("Sub 3 h" = "#cb2029",
                "3:00-3:30" = "#147f77",
                "3:30-4:00" = "#cf6d21",
                "4:00-5:00" = "#28a91b",
                "Over 5 h" = "#a31a6d")

## make the plots ----

ggplot(df, aes(x = Difference, fill = after_stat(x))) +
  # vertical line at x = 0
  geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
  geom_histogram(breaks = seq(
    from = -59.5, to = 81.5, by = 1), color = "black") +
  scale_colour_gradient2(
    low = "#2b83ba",
    mid = "#ffffbf",
    high = "#d7191c",
    midpoint = 0,
    limits = c(-15,15),
    na.value = "#ffffffff",
    guide = "colourbar",
    aesthetics = "fill",
    oob = scales::squish
  ) +
  scale_x_continuous(breaks = seq(-45,90,15), limits = c(-40, 80)) +
  facet_wrap(~ Category, ncol = 1, scales = "free_y") +
  labs(caption = cap) +
  labs(title = "Most runners positive split the marathon",
       x = "Difference in minutes (Second Half - First Half)",
       y = "Number of Runners",
       caption = cap) +
  theme_classic() +
  # hide legend
  theme(legend.position = "none") +
  theme(
    plot.caption = element_textbox_simple(
      colour = "grey25",
      hjust = 0,
      halign = 0,
      margin = margin(b = 0, t = 5),
      size = rel(0.9)
    ),
    text = element_text(family = "roboto", size = 16),
    plot.title = element_text(size = rel(1.2),
                              face = "bold")
  )

ggsave("Output/Plots/nyc_marathon_2025_split_difference_histogram.png",
       width = 900, height = 1200, dpi = 72, units = "px", bg = "white")

ggplot() +
  geom_abline(slope = 1, linetype = "dashed", color = "black") +
  geom_point(data = df,
             aes(x = split_HALF, y = split_SECOND_HALF, colour = Category),
             shape = 16, size = 1.5, alpha = 0.1) +
  scale_x_continuous(breaks = seq(from = 0, to = 12 * 30, by = 30),
                     labels = seq(from = 0, to = 6, by = 0.5),
                     limits = c(1 * 60, 5 * 60)) +
  scale_y_continuous(breaks = seq(from = 0, to = 12 * 30, by = 30),
                     labels = seq(from = 0, to = 6, by = 0.5),
                     limits = c(1 * 60, 5 * 60)) +
  scale_colour_manual(values = my_palette) +
  labs(x = "First half time (h)",
       y = "Second half time (h)",
       caption = cap) +
  theme_bw() +
  theme(
    plot.caption = element_textbox_simple(
      colour = "grey25",
      hjust = 0,
      halign = 0,
      margin = margin(b = 0, t = 10),
      size = rel(0.9)
    ),
    text = element_text(family = "roboto", size = 16)
  ) +
  guides(colour = guide_legend(override.aes = list(alpha = 1)))

ggsave("Output/Plots/nyc_marathon_2025_split_difference_scatter.png",
       width = 1000, height = 800, dpi = 72, units = "px", bg = "white")

From this data we can also make some calculations to understand…

## fitting ----

# to fit, we'll constrain the line to go through (60,60), i.e. a
# 2 h marathoner who runs even splits
fitting <- data.frame(x = df$split_HALF,y = df$split_SECOND_HALF)
lm( I(y-60) ~ I(x-60) + 0, data = fitting)


# Call:
#   lm(formula = I(y - 60) ~ I(x - 60) + 0, data = fitting)
# 
# Coefficients:
#   I(x - 60)  
# 1.239  

# so for a 90 minute first half, second half would be:
# 60 + 1.239 * (90 - 60) = 97.17 minutes, a finish time of 3:07:10

# to run a 3 h New York Marathon, the average runner needs to run
# 60 / 2.239 + 60 = 86.8 minutes for the first half
# so 1:26:48 for the first half, and 1:33:12 for the second half

# a more simple approach is to calculate the mean of the ratios
mean_ratio <- mean(df$split_SECOND_HALF / df$split_HALF)
mean_ratio
# [1] 1.127581

# filter the df for finish times between 170 and 180 minutes
target <- df[df$split_MAR > 170 & df$split_MAR < 180,]
summary(target)


    RunnerID         time.HALF           time.MAR           split_HALF      split_MAR     split_SECOND_HALF   Difference     
 Min.   :48819892   Length:1289        Length:1289        Min.   :70.25   Min.   :170.0   Min.   : 82.70    Min.   :-7.6833  
 1st Qu.:48834548   Class :character   Class :character   1st Qu.:84.42   1st Qu.:173.6   1st Qu.: 88.07    1st Qu.: 0.7167  
 Median :48849752   Mode  :character   Mode  :character   Median :86.30   Median :176.0   Median : 89.62    Median : 3.0500  
 Mean   :48849498                                         Mean   :85.98   Mean   :175.7   Mean   : 89.73    Mean   : 3.7585  
 3rd Qu.:48864551                                         3rd Qu.:87.87   3rd Qu.:178.2   3rd Qu.: 91.12    3rd Qu.: 5.9000  
 Max.   :48878979                                         Max.   :92.87   Max.   :180.0   Max.   :106.02    Max.   :35.7667  
 Difference_Fraction      Category   
 Min.   :-8.5008     Sub 3 h  :1289  
 1st Qu.: 0.8051     3:00-3:30:   0  
 Median : 3.5390     3:30-4:00:   0  
 Mean   : 4.5178     4:00-5:00:   0  
 3rd Qu.: 7.0055     Over 5 h :   0  
 Max.   :50.9134   

The post title comes from “Marathon Man” by Ian Brown from his “My Way” album. He’s wearing a track suit on the cover but that’s not optimal wear for running a marathon.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Marathon Man: how to pace a marathon]]>
400370
One interface, (Almost) Every Classifier: unifiedml v0.2.1 https://www.r-bloggers.com/2026/04/one-interface-almost-every-classifier-unifiedml-v0-2-1/ Sat, 04 Apr 2026 00:00:00 +0000 https://thierrymoudiki.github.io//blog/2026/04/04/r/more-unifiedml-classifiers A new version of `unifiedml` is out; available on CRAN. `unifiedml` is an effort to offer a unified interface to R's machine learning models.
Continue reading: One interface, (Almost) Every Classifier: unifiedml v0.2.1]]>
[social4i size="small" align="align-left"] -->
[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new version of unifiedml is out; available on CRAN. unifiedml is an effort to offer a unified interface to R’s machine learning models.

The main change in this version 0.2.1 is the removal of type (of prediction) from predict, and the use of... instead, which is more generic and flexible.

This post contains advanced examples of use of unifiedml for classification, with ranger and xgboost. More examples have been added to the package vignettes too.

install.packages("unifiedml")

install.packages(c("ranger"))

library("unifiedml")

Loading required package: doParallel

Loading required package: foreach

Loading required package: iterators

Loading required package: parallel

Loading required package: R6

1 – ranger example

library(ranger)



# 2 - 'ranger' classification ---------------------------

# -------------------------------
# S3 wrapper for ranger
# -------------------------------

# Fit function remains the same
my_ranger <- function(x, y, ...) {
  if (!is.data.frame(x)) x <- as.data.frame(x)
  y <- as.factor(y)
  colnames(x) <- paste0("X", seq_len(ncol(x)))
  df <- data.frame(y = y, x)
  fit <- ranger::ranger(y ~ ., data = df, probability = TRUE, ...)
  structure(list(fit = fit), class = "my_ranger")
}

# Predict only with newdata
predict.my_ranger <- function(object, newdata = NULL, newx = NULL, ...) {
  if (!is.null(newx)) newdata <- newx
  if (is.null(newdata)) stop("No data provided for prediction")
#  misc::debug_print(newx)
#  misc::debug_print(newdata)
  if (is.matrix(newdata)) newdata <- as.data.frame(newdata)
#  misc::debug_print(newdata)
  # Unconditionally rename to match training
  colnames(newdata) <- paste0("X", seq_len(ncol(newdata)))
#  misc::debug_print(newdata)
  preds <- predict(object$fit, data = newdata)$predictions
#  misc::debug_print(newdata)
  if (is.matrix(preds) && ncol(preds) == 2) {
    lvls <- colnames(preds)
    return(ifelse(preds[, 2] > 0.5, lvls[2], lvls[1]))
  }

  preds
}

# Print method
print.my_ranger <- function(x, ...) {
  cat("my_ranger model\n")
  print(x$fit)
}

# -------------------------------
# Example: Iris binary classification
# -------------------------------

set.seed(123)
iris_binary <- iris[iris$Species %in% c("setosa", "versicolor"), ]
X_binary <- iris_binary[, 1:4]
y_binary <- as.factor(as.character(iris_binary$Species))

# Train/test split
train_idx <- sample(seq_len(nrow(X_binary)), size = 0.7 * nrow(X_binary))
X_train <- X_binary[train_idx, ]
y_train <- y_binary[train_idx]
X_test <- X_binary[-train_idx, ]
y_test <- y_binary[-train_idx]

# Initialize and fit model
# Initialize model
mod <- Model$new(my_ranger)

# Fit on training data only
mod$fit(X_train, y_train, num.trees = 150L)

# Predict on test set
preds <- mod$predict(X_test)

# Evaluate
table(Predicted = preds, True =y_test)
mean(preds == y_test)  # Accuracy



# 5-fold cross-validation on training set
cv_scores <- cross_val_score(
  mod,
  X_train,
  y_train,
  num.trees = 150L,
  cv = 5L
)

cv_scores
mean(cv_scores)  # average CV accuracy


            True
Predicted    setosa versicolor
  setosa         15          0
  versicolor      0         15

1

  |======================================================================| 100%

  1. 1
  2. 1
  3. 1
  4. 1
  5. 1

1

2 - xgboost example

library(xgboost)

my_xgboost <- function(x, y, ...) {
  
  # Convert to matrix safely
  if (!is.matrix(x)) {
    x <- as.matrix(x)
  }
  
  # Handle factors
  if (is.factor(y)) {
    y <- as.numeric(y) - 1
  }
  
  fit <- xgboost::xgboost(
    data = x,
    label = y,
    ...
  )
  
  structure(list(fit = fit), class = "my_xgboost")
}

predict.my_xgboost <- function(object, newdata, ...) {
  
  # Ensure matrix
  newdata <- as.matrix(newdata)
  
  preds <- predict(object$fit, newdata)
  
  # If binary classification → convert probs to class
  if (!is.null(object$fit$params$objective) &&
      grepl("binary", object$fit$params$objective)) {
    
    return(ifelse(preds > 0.5, 1, 0))
  }
  
  preds
}

predict.my_xgboost <- function(object, newdata = NULL, newx = NULL, ...) {
  
  # Accept both conventions
  if (!is.null(newx)) {
    newdata <- newx
  }
  
  newdata <- as.matrix(newdata)
  
  preds <- predict(object$fit, newdata)
  
  # Binary classification → class labels
  if (!is.null(object$fit$params$objective) &&
      grepl("binary", object$fit$params$objective)) {
    
    return(ifelse(preds > 0.5, 1, 0))
  }
  
  preds
}

print.my_xgboost <- function(x, ...) {
  cat("my_xgboost model\n")
  print(x$fit)
}


set.seed(123)  # for reproducibility

# Binary subset
iris_binary <- iris[iris$Species %in% c("setosa", "versicolor"), ]
X_binary <- as.matrix(iris_binary[, 1:4])
y_binary <- as.factor(as.character(iris_binary$Species))

# Split indices: 70% train, 30% test
train_idx <- sample(seq_len(nrow(X_binary)), size = 0.7 * nrow(X_binary))
X_train <- X_binary[train_idx, ]
y_train <- y_binary[train_idx]
X_test <- X_binary[-train_idx, ]
y_test <- y_binary[-train_idx]

# Initialize model
mod <- Model$new(my_xgboost)

# Fit on training data only
mod$fit(X_train, y_train, nrounds = 50, objective = "binary:logistic")

# Predict on test set
preds <- mod$predict(X_test)

# Evaluate
table(Predicted = preds, True =y_test)
mean(preds == y_test)  # Accuracy



# 5-fold cross-validation on training set
cv_scores <- cross_val_score(
  mod, 
  X_train, 
  y_train, 
  nrounds = 50, 
  objective = "binary:logistic", 
  cv = 5L
)

cv_scores
mean(cv_scores)  # average CV accuracy

image-title-here

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: One interface, (Almost) Every Classifier: unifiedml v0.2.1]]>
400351
You can just build your own programming language https://www.r-bloggers.com/2026/04/you-can-just-build-your-own-programming-language/ Fri, 03 Apr 2026 00:00:00 +0000 https://b-rodrigues.github.io/posts/2026-04-03-tproject.html

Last summer, while relaxing on the beaches of Berck, a French town known for treating tuberculosis in kids by exposing them to the fresh maritime air (back in 19th century, they have antibiotics these days), I found myself daydreaming abou...

Continue reading: You can just build your own programming language]]>
[social4i size="small" align="align-left"] -->
[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last summer, while relaxing on the beaches of Berck, a French town known for treating tuberculosis in kids by exposing them to the fresh maritime air (back in 19th century, they have antibiotics these days), I found myself daydreaming about building my own programming language.

Spoiler alert: I don’t know how to build programming languages, but I have developed extremely strong opinions over the years about the features a modern data science language should have. So could I use them fancy LLMs to build one?

Also, let’s get one question answered straight away: why create a new language instead of contributing to existing ones? I certainly do contribute, I maintain several R packages like {rix}, {rixpress}, and {chronicler}, and even have two Python packages (cronista and ryxpress), but I wanted a clean slate to build a system centered around a few non-negotiable principles and features I’ve implemented over the years in R:

  • Reproducibility-First: A language where reproducibility isn’t a bolt-on afterthought managed by external tools, but the very foundation of the runtime.
  • Aggressive Re-use: Instead of reinventing the wheel, this language would stand on the shoulders of giants. It’d use Nix for package management and environment isolation, and Apache Arrow as its high-performance backbone for data frames. R, Python, Julia and other languages would provide the algorithms and models.
  • First-Class Pipelines: Scripts shouldn’t be a sequence of side-effects. In this language, pipelines would be mandatory and first-class citizens.
  • Fail Early and Loudly: No silent type conversions or hidden NAs. If something is wrong, the language breaks immediately so you can fix it.
  • Errors as Objects: Inspired by functional programming, errors are first-class values that can be inspected and handled gracefully.
  • Two Pipes: I want two pipes, one for linear transformations, |>, and a maybe-pipe, ?|> for error recovery. Unlike the standard pipe, ?|> always forwards its value, including Errors, to the next function, allowing you to write handlers that inspect and potentially recover from them. Since Errors are just values, this composes naturally with the rest of the language.
  • Polyglot by Design: Rather than re-implementing every statistical algorithm, this language would be designed to orchestrate and bridge R, Python, and Julia seamlessly.

Also, we’re in a post LLM world, and like them or not, they’re here to stay. They’re pretty useful to write boilerplate code and so any new language would be dead on arrival if it didn’t play nicely with LLMs. So such a new language would need to be written for LLMs primarily, because I don’t expect anyone to learn any new language. This is where the declarative nature of Nix is a huge advantage. Because environments are precisely described, it is much easier for LLMs to focus on generating code and not have to fight with environment setup. This is also the reason I took another radical decision: since Nix would be mandatory for setting up the environment, why bother building OS-specific binaries? I’d just build a Nix package for this language and let Nix handle the rest.

This architecture results in a DSL for orchestration, making it trivial to transfer data objects between different ecosystems without the usual FFI (Foreign Function Interface) friction.

With these ideas in mind, I started prompting Gemini to brainstorm and started by generating specification files. Very broad first, but as days went by, more and more focused. The way I went about it (and still go) is that I first brainstorm an idea with an LLM, then I ask it to generate a specification file, then I refine it, ask it to generate a new specification file, and so on. Once I’m happy with the spec, I ask an LLM to generate a minimal implementation of the spec. Usually writing the spec and a first implementation is a task shared between Claude and Gemini (through Antigravity). Then I open a pull request and ask GitHub Copilot to review it (usually with GPT-5.x). I repeat this process until I’m happy with the implementation. I always ask for documentation and unit tests (and golden tests when relevant, more on this later).

I started to really believe that I had something interesting, so I gave it a shot, and called it T. I had long joked that the natural successor to R should be called T (because R is the successor to S… and no, I’m not going to call it Q because that sounds like the word for ass in French).

Something else that made me confident I could succeed, besides my own hubris, was that I am pretty familiar with unit testing, test-driven development, trunk-based development and Nix. When you combine all these elements, it makes developing with LLMs quite safe.

So I just started prompting. And now I’m quite happy to announce that there is a beta version of T that you can use today!

By leveraging Nix as a build engine, T can treat complex data science workflows as buildable derivations. A typical T pipeline looks like this:

p = pipeline {
  -- 1. Python node: read data with pandas
  mtcars_pl = pyn(
    command = <{
import pandas as pd
pd.read_csv("data/mtcars.csv", sep="|")
    }>,
    include = ["data/mtcars.csv"],
    serializer = ^csv
  )

  -- 2. Python node: filter and serialize as CSV
  mtcars_pl_am = pyn(
    command = <{
mtcars_pl[mtcars_pl['am'] == 1]
    }>,
    deserializer = ^csv,
    serializer = ^csv
  )

  -- 3. R node: read CSV and take head using functions.R
  mtcars_head = rn(
    command = <{
my_head(mtcars_pl_am)
    }>,
    functions = ["src/functions.R"],
    deserializer = ^csv,
    serializer = ^csv
  )

  -- 4. R node: select column with dplyr
  mtcars_mpg = rn(
    command = <{
library(dplyr)
mtcars_head %>% select(mpg)
    }>,
    deserializer = ^csv,
    serializer = ^csv
  )

  -- Render Quarto report
  report = node(script = "src/report.qmd", runtime = Quarto)
}

-- Materialize the pipeline
populate_pipeline(p, build = true)
pipeline_copy() -- Copy the outputs from the Nix store to your working directory

As you can see, each node has a command argument where you can write literal R or Python code. It is also possible to provide the path to a script instead. If packages need to be loaded for the code to work, you can just write the calls to load the required packages in the command argument as well.

While T is heavily inspired by the {targets} package in R, it takes the concept a step further by making pipelines first-class objects within the language itself. This means you can:

  • Compose Pipelines: You can define small, modular pipelines and then merge them into larger ones using standard operators.
  • Static Analysis: Because the DAG (Directed Acyclic Graph) is defined within the language, T can validate your entire workflow (checking for circular dependencies or missing data) before a single line of code even runs.
  • Heterogeneous Execution: A single pipeline can effortlessly mix R, Python, and native T code. Data is passed between these nodes using built-in serializers like ^csv, ^arrow, or even specialized formats like ^pmml for traditional models and ^onnx for deep learning architectures. It is also possible to define your own serializers.
  • Immutable State: Each node output is managed by Nix, meaning if you haven’t changed the code or the data for a specific node, T (via Nix) will simply pull the cached result from previous runs.

But don’t let the “orchestrator” label fool you; T is also a capable language in its own right. It features a selection of built-in packages inspired by the tidyverse for data manipulation. Thanks to its Arrow backend, it is surprisingly fast. I even maintain a CI benchmark running on NYC Taxi data to ensure performance remains competitive.

I made sure that T is pretty easy to use with LLMs by providing a file called summary.md in the root of the GitHub repository. This file is meant to be used by LLMs to quickly learn the language’s syntax and generate code accordingly. You could also provide the whole help documentation to the LLM (found in the repository under help/docs.json), but I found that a summary is usually enough. There is also another experimental feature I’m thinking about, called intent blocks. These blocks would essentially be first-class structured comments that would be used to anchor LLM’s behaviour and make it more deterministic. These blocks would be parsed by T and used to generate code accordingly. I have some ideas how these could look like, something like this:

intent {
  description: "Customer churn prediction",
  assumptions: ["Age > 18", "NA imputed with multiple imputation"],
  requires: ["dataset.csv"]
}

Is this slop?

There’s a lot of skepticism about building your own language using LLMs, and I get it. I was pretty skeptical myself. So let me tell you what actually gives me confidence in T’s correctness: as of writing, 1753 unit tests, 122 golden tests, 13 end-to-end tests, and 18 full project demos are executed on every push and PR, on both Linux and macOS via GitHub Actions. That’s the verification regime, and it has to be rigorous precisely because I can’t audit the OCaml implementation by eye. This is actually one of the more interesting lessons from this project: when you can’t rely on code review, you have to over-invest in tests and specifications. The spec files, the enriched changelog, the summary.md, all of that context makes the LLM’s output more predictable, and the test suite tells you immediately when it isn’t.

From personal experience, when I generate R or Python code, the output looks a lot like what I would have written myself. The main failure mode I’ve noticed is lack of context: the more you give the model, the better the result. Letting separate LLMs review PRs and iterating through several loops helps catch what any single model misses.

I’m also confident in T’s safety from a different angle: it’s ultimately orchestrating Python and R code you write yourself, and that you can test independently.

Interested?

If you’re interested in trying it out or contributing, check out the official repository or the website, and don’t hesitate to open an issue or a PR or contact me on the dedicated Matrix (https://matrix.to/#/#tproject:matrix.org) channel.

Appendix

For the interested reader, here’s how to get started with T.

How to get started

If you have Nix installed, getting started with a new project is just a single command away:

# 1. Initialize a new project
nix run github:b-rodrigues/tlang -- init --project my_t_project
cd my_t_project

There will be no other way to start a T project. As explained above, I don’t want to have to deal with providing OS-specific binaries, and since Nix is used by T as the build engine, you’ll need to have Nix installed on your system anyways. Might as well reuse it to manage the install T itself!

Inside the project’s folder, you’ll find a tproject.toml file. This is were you list R and Python packages you’ll need. For example:

[project]
name = "r_py_xgboost_t"
description = "A T data analysis project"

[dependencies]
# T packages this project depends on
# Format: package = { git = "repository-url", tag = "version" }
# Example:
# stats = { git = "https://github.com/t-lang/stats", tag = "v0.5.0" }

[r-dependencies]
packages = ["dplyr", "yardstick"]

[py-dependencies]
version = "python313"
packages = ["numpy", "pandas", "scikit-learn", "xgboost"]

[additional-tools]
packages = ["quarto"]

[t]
# Minimum T language version required
min_version = "0.51.2"

Under “additional tools” you can add any package that is available in nixpkgs. If you need LaTeX, you can also add this dedicated section:

\(\)
packages = ["amsmath", "geometry", "hyperref", "biblatex"]

You may have noticed that there is also a section for T packages; that’s right, T supports user-defined packages. Instead of starting a project you’d start a package:

nix run github:b-rodrigues/tlang -- init --package my_package
cd my_package

Instead of a tproject.toml file, you’ll have to fill a DESCRIPTION.toml file:

[package]
name = "my_package"
version = "0.1.0"
description = "A brief description of what my_package does"
authors = ["brodriguesco"]
license = "EUPL-1.2"
homepage = ""
repository = ""

[dependencies]
# T packages this package depends on
# Format: package = { git = "repository-url", tag = "version" }

[t]
# Minimum T language version required
min_version = "0.5.0"

Another important file is the flake.nix that will be automatically generated. You shouldn’t have to touch it, but this flake.nix is what provides the reproducible development environment for running your project. To do so, simply use:

nix develop

This will install T and activate the environment. If you’ve added stuff to the tproject.toml you’ll have to run t update to sync the packages to the flake, and then rebuild the environment (you’ll need to exit the development environment with exit and rebuild it using nix develop again). Oh and by the way, T requires a Linux-like environment so if you’re on Windows, you’ll have to run T within WSL2 (Windows Subsystem for Linux).

Once inside the nix develop shell, everything you need, the T interpreter, your specific versions of R/Python, and all project tools, is ready to use. You don’t need to manage virtual environments or Docker containers manually; T handles the heavy lifting via Nix under the hood.

You can browse examples on this repository.

Tooling and Editor Support

A language is only as good as its developer experience. I politely asked LLMs to implement a full Language Server (LSP) for T, which provides autocompletion, real-time diagnostics, and “Go to Definition” support.

  • For VS Code / Positron: A dedicated extension providing syntax highlighting and LSP integration.
  • For Vim / Emacs: Detailed configuration guides and syntax files are available.
  • For Quarto: T is fully compatible with Quarto for literate programming, allowing you to run executable {t} chunks directly in your documents.

For detailed setup instructions, check out the Editor Support guide in the official documentation.

There’s much more I haven’t covered here, so check out the official repository or the website.

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: You can just build your own programming language]]>
400328