R-bloggers

What’s new in R 4.6.0?

The Jumping Rivers Blog — Thu, 16 Apr 2026 23:59:00 +0000

[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R 4.6.0 (“Because it was There”) is set for release on April 24th 2026. Here we summarise some of the more interesting changes that have been introduced. In previous blog posts, we have discussed the new features introduced in R 4.5.0 and earlier versions (see the links at the end of this post).

Once R 4.6.0 is released, the full changelog will be available at the r-release ‘NEWS’ page. If you want to keep up to date with developments in base R, have a look at the r-devel ‘NEWS’ page.

`! (values %in% collection)`

Code should be readable, and easily understood. And although it isn’t a natural language, there’s something off about code that reads like:

If not a blog post is readable, I close the browser tab.

To check if one (or more) value is in some collection, R has the %in% operator:

"a" %in% letters
[1] TRUE

"a" in LETTERS
[1] FALSE

This is different from the in keyword, which you use when iterating over a collection:

for (x in letters[1:3]) {
 message(x)
}
# a
# b
# c

Sometimes you want to know whether a value is absent from a collection. The standard way to do this is to invert results from %in%:

! "a" %in% LETTERS
[1] TRUE

It’s unambiguous to the R interpreter. But it can be hard to read and understand – on scanning that statement, you might forget that ! acts after the %in%. As such, we often wrap the %in% expression with parentheses to make the code more clear:

! ("a" %in% LETTERS)
[1] TRUE

For the sake of clarity, many developers have implemented their own absence-checking operator. Writing a custom operator in R uses similar syntax to that used when writing a function:

`%NOTIN%` = function(x, y) {
 ! (x %in% y)
}

"a" %NOTIN% LETTERS
[1] TRUE

Were you to write the same code multiple times in the same project, you would write a function. Similarly, if you (or your team) wrote the same function in multiple files or projects, you might add it to a package and import it. So if lots of package developers have implemented the same operator or function, across their CRAN packages, maybe it should be pushed to a higher plane…

That is what has happened with the introduction of %notin% in R 4.6.0. An operator that was found across lots of separate packages has been moved up into base R:

"a" %notin% LETTERS
[1] TRUE

"a" %notin% letters
[1] FALSE

DOI citations

If you use R in your publications or your projects, you may need to provide a citation for it. rOpenSci has a blog post about citing R and R packages – why, when and how to do it.

For the R project as a whole, there is a simple function citation() that provides the information you need:

citation()

To cite R in publications use:

 R Core Team (2026). _R: A Language and Environment for Statistical
 Computing_. R Foundation for Statistical Computing, Vienna, Austria.
 doi:10.32614/R.manuals .
 .

A BibTeX entry for LaTeX users is

 @Manual{,
 ...
 }
...

In R 4.6.0, a DOI (Digital Object Identifier) has been added to the citation entry to make it easier to reference R in your published work.

`summary(character_vector, character.method = "factor")`

str() and summary() are two of the first functions I reach for when exploring a dataset. For a data-frame, these tell me

what type of columns are present (str(): chr, num, Date, …); and
what is present in each column (summary(): gives the min, max, mean of each numeric column, for example).

summary() works with data-structures other than just data-frames.

For factors, summary() tells you how many observations of the factor levels were observed:

# 'species' and 'island' are factor columns in `penguins`
summary(penguins[1:3])

 species island bill_len
 Adelie :152 Biscoe :168 Min. :32.10
 Chinstrap: 68 Dream :124 1st Qu.:39.23
 Gentoo :124 Torgersen: 52 Median :44.45
 Mean :43.92
 3rd Qu.:48.50
 Max. :59.60
 NA's :2

For character columns, the output from summary() has been a little obtuse:

# R 4.5.0

# 'studyName' and 'Species' are character columns in `penguins_raw`
summary(penguins_raw[1:3])

 studyName Sample Number Species
 Length:344 Min. : 1.00 Length:344
 Class :character 1st Qu.: 29.00 Class :character
 Mode :character Median : 58.00 Mode :character
 Mean : 63.15
 3rd Qu.: 95.25
 Max. :152.00

R 4.6.0 adds a neater way to summarise character vectors/columns to summary():

# R 4.6.0
summary(penguins_raw[1:3])

 studyName Sample Number Species
 Length :344 Min. : 1.00 Length :344
 N.unique : 3 1st Qu.: 29.00 N.unique : 3
 N.blank : 0 Median : 58.00 N.blank : 0
 Min.nchar: 7 Mean : 63.15 Min.nchar: 33
 Max.nchar: 7 3rd Qu.: 95.25 Max.nchar: 41
 Max. :152.00

We can also summarise character vectors/columns as if they were factors:

# R 4.6.0
summary(penguins_raw[1:3], character.method = "factor")

 studyName Sample Number Species
 PAL0708:110 Min. : 1.00 Adelie Penguin (Pygoscelis adeliae) :152
 PAL0809:114 1st Qu.: 29.00 Chinstrap penguin (Pygoscelis antarctica): 68
 PAL0910:120 Median : 58.00 Gentoo penguin (Pygoscelis papua) :124
 Mean : 63.15
 3rd Qu.: 95.25
 Max. :152.00

`list.files(..., fixed = TRUE)`

Suppose we have three files in my working directory: abc.Rmd, CONTRIBUTING.md and README.md. If I want to obtain the filenames for the “.md” files from an R script, I can list those files that match a pattern:

# R 4.5.0
list.files(pattern = ".md")
[1] "abc.Rmd" "CONTRIBUTING.md" "README.md"

Hmmm.

In the pattern, . actually matches any character. In R 4.5.0, if I want to match the . character explicitly, I can escape it in the pattern. But pattern matching can lead to complicated code in R, because some characters are treated specially by the pattern matcher, and some are treated specially by R’s string parser.

To tell R to find files with a literal ‘.md’ in the filename, we escape the . character twice: once for the . (to tell the pattern matcher to match a ., rather than any character), and once to escape the \ (to tell R that the subsequent \ is really a backslash)

# R 4.5.0
list.files(pattern = "\\.md")
[1] "CONTRIBUTING.md" "README.md"

R 4.0.x cleaned that up a bit, we can now use ‘raw strings’ in R. Everything between the parentheses in the next pattern is passed directly to the pattern matcher:

# R 4.5.0
list.files(pattern = r"(\.md)")

Now, in R 4.6.0, we can indicate to list.files() (and the synonym dir()) that our pattern is a fixed string (rather than a regular expression). With this, we don’t need to escape the . character to match the “.md” suffix.

list.files(pattern = ".md", fixed = TRUE)
[1] "CONTRIBUTING.md" "README.md"

Other matters

read.dcf() now allows comment lines, which means you can annotate your config files (e.g., for {lintr}).
df |> plot(col2 ~ col1) can now be used for base plotting; this is a little neater for exploratory work than df |> plot(col2 ~ col1, data = _)
C++20 is now the default C++ standard

Trying out R 4.6.0

To take away the pain of installing the latest development version of R, you can use docker. To use the devel version of R, you can use the following commands:

docker pull rstudio/r-base:devel-jammy
docker run --rm -it rstudio/r-base:devel-jammy

Once R 4.6 is the released version of R and the r-docker repository has been updated, you should use the following command to test out R 4.6.

docker pull rstudio/r-base:4.6-jammy
docker run --rm -it rstudio/r-base:4.6-jammy

An alternative way to install multiple versions of R on the same machine is using rig.

My Domain: proteome-wide scanning of TMDs

Stephen Royle — Thu, 16 Apr 2026 12:04:00 +0000

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I wanted to know:

how many proteins in the human proteome have transmembrane domains?
of those that do, how many have 1 or 2 or n transmembrane domains?

After a little bit of searching, I couldn’t find any answers. So I decided to use R to retrieve the necessary info from Uniprot and calculate it myself. I thought I’d post it here in case it’s useful for others.

Human

We’ll start with the info I wanted. According to Uniprot there are 20,659 proteins in the human proteome. One quarter of these have one or more TMD. The majority have one TMD and there are almost 1,000 7TM proteins (all those GPCRs, I guess). There’s 413 4TM and 327 2TM proteins. We can find examples of 1 through 17 TMDs, there’s no proteins with 18, 4 proteins with 19TM, 21 with 24TM and 2 with 38TM.

The analysis is done simply by looking at how many TRANSMEM Uniprot has for each IDs in the reference proteome. I have not distinguished between helical and partial entries, and of course it’s possible that the annotations are not quite correct.

TMDs	Count	Frequency (proteome as %)	Frequency (TMDPs as %)
1	2402	11.6	45.8
2	327	1.6	6.2
3	159	0.8	3.0
4	413	2.0	7.9
5	77	0.4	1.5
6	276	1.3	5.3
7	947	4.6	18.0
8	83	0.4	1.6
9	63	0.3	1.2
10	123	0.6	2.3
11	75	0.4	1.4
12	202	1.0	3.8
13	24	0.1	0.5
14	26	0.1	0.5
15	13	0.1	0.2
16	1	0.0	0.0
17	9	0.0	0.2
19	4	0.0	0.1
24	21	0.1	0.4
38	2	0.0	0.0

Having written this code, I decided to run some other proteomes to see how they compare.

Model organisms

These four organisms have between 18% and 29% of the proteome made of proteins with TMDs. The pattern of numbers of TMDs are kind of similar although there’s no peak in yeast for 7TM and the peaks for 2, 4, 6 or 12 TMs differ from human.

Maybe this information is out there in some database or other. As I said, I couldn’t find something easily. Even if there are more precise ways of determining the TMDs, I think this data is good enough to know roughly what the proportions are.

The code

I manually downloaded the fasta.gz files for the reference proteomes. They are currently linked here.

To extract all the Uniprot IDs, I used a shell one-liner:

awk '/^>sp\|.*\|/{gsub(/^>sp\|/,""); gsub(/\|.*/,""); print ">" $0; next} {print}' file.fasta | grep "^>" | sed 's/>//g' > species_uniprot.txt

Then I used this R script. The main function can probably be simplified. I had to add several checks to make sure I got all the data back from the API. Before posting, I tried to cut it back to make it easier to read, but Ionly succeeded in breaking the script! This is the working version.

# if (!require("BiocManager", quietly = TRUE)) {
#   install.packages("BiocManager") 
# }
# BiocManager::install("biomaRt")
library(httr)
library(stringr)
library(ggplot2)
library(biomaRt)
library(dplyr)
library(tidyr)
library(cowplot)

## FUNCTIONS ----

isJobReady <- function(jobId, pollingInterval = 5, maxWaitSeconds = 3600) {
  if (is.null(jobId) || length(jobId) == 0 || is.na(jobId) || !nzchar(jobId)) {
    return(FALSE)
  }
  nTries <- ceiling(maxWaitSeconds / pollingInterval)
  for (i in 1:nTries) {
    url <- paste("https://rest.uniprot.org/idmapping/status/", jobId, sep = "")
    r <- GET(url = url, accept_json())
    status <- content(r, as = "parsed")
    if (!is.null(status[["results"]]) || !is.null(status[["failedIds"]])) {
      return(TRUE)
    }
    if (!is.null(status[["messages"]])) {
      print(status[["messages"]])
      return(FALSE)
    }
    Sys.sleep(pollingInterval)
  }
  return(FALSE)
}

retrieveUniprotInfo <- function(x,
                                chunk_size = 5000,
                                maxWaitSeconds = 3600,
                                taxId = "9606",
                                progress = TRUE) {
  normalize_uniprot_ids <- function(values) {
    values <- trimws(values)
    # Accept FASTA-style headers like: sp|P12345|... or tr|A0A...|...
    m <- str_match(values, regex("^>?\\s*(?:sp|tr)\\|([^|]+)\\|", ignore_case = TRUE))
    values <- ifelse(!is.na(m[, 2]), m[, 2], values)
    values
  }
  
  ids <- unique(normalize_uniprot_ids(x))
  ids <- ids[!is.na(ids) & nzchar(ids)]
  if (length(ids) == 0) {
    stop("No valid identifiers were provided to retrieveUniprotInfo().")
  }
  
  fields <- "accession,id,protein_name,gene_names,ft_transmem,length,cc_function,cc_subcellular_location,go_p,go_c"
  acc_pattern <- "^[OPQ][0-9][A-Z0-9]{3}[0-9](-[0-9]+)?$|^[A-NR-Z][0-9](?:[A-Z][A-Z0-9]{2}[0-9]){1,2}(-[0-9]+)?$"
  is_accession <- str_detect(ids, acc_pattern)
  
  split_into_chunks <- function(values, chunk_size = chunk_size) {
    split(values, ceiling(seq_along(values) / chunk_size))
  }
  
  get_next_link <- function(link_header) {
    if (is.null(link_header)) {
      return(NULL)
    }
    links <- unlist(strsplit(link_header, ",\\s*"))
    next_link <- links[str_detect(links, "rel=\\\"next\\\"")]
    if (length(next_link) == 0) {
      return(NULL)
    }
    next_url <- str_extract(next_link[1], "(?<=<).+?(?=>)")
    if (is.na(next_url) || !nzchar(next_url)) {
      return(NULL)
    }
    next_url
  }
  
  read_tsv_response <- function(resp) {
    read.table(
      text = content(resp, as = "text", encoding = "UTF-8"),
      sep = "\t",
      header = TRUE,
      fill = TRUE,
      quote = "",
      comment.char = "",
      check.names = FALSE
    )
  }
  
  fetch_from_redirect <- function(redirect_url) {
    if (is.null(redirect_url) || length(redirect_url) == 0 ||
        is.na(redirect_url) || !nzchar(redirect_url)) {
      return(NULL)
    }
    
    # The paged idmapping results endpoint is capped at size <= 500.
    # Use stream endpoint to retrieve the full chunk in one response.
    stream_url <- gsub("/results/", "/results/stream/", redirect_url)
    sep <- ifelse(str_detect(stream_url, "\\?"), "&", "?")
    stream_url <- paste0(
      stream_url,
      sep,
      "fields=", URLencode(fields, reserved = TRUE),
      "&format=tsv"
    )
    
    r <- GET(url = stream_url)
    if (status_code(r) < 400) {
      return(read_tsv_response(r))
    }
    
    # Fallback to paged endpoint if stream is unavailable.
    sep <- ifelse(str_detect(redirect_url, "\\?"), "&", "?")
    url <- paste0(
      redirect_url,
      sep,
      "fields=", URLencode(fields, reserved = TRUE),
      "&format=tsv&size=500"
    )
    
    r <- GET(url = url)
    stop_for_status(r)
    resultsTable <- read_tsv_response(r)
    
    next_url <- get_next_link(headers(r)[["link"]])
    while (!is.null(next_url) && !is.na(next_url) && nzchar(next_url)) {
      r <- GET(url = next_url)
      stop_for_status(r)
      resultsTable <- rbind(resultsTable, read_tsv_response(r))
      next_url <- get_next_link(headers(r)[["link"]])
    }
    
    resultsTable
  }
  
  map_ids <- function(values, from_db, to_db, chunk_size, taxId = NULL,
                      label = "ids") {
    if (length(values) == 0) {
      return(NULL)
    }
    
    results_list <- list()
    chunks <- split_into_chunks(values, chunk_size = chunk_size)
    n_chunks <- length(chunks)
    for (i in seq_along(chunks)) {
      chunk <- chunks[[i]]
      if (isTRUE(progress)) {
        cat(sprintf("[UniProt] %s chunk %d/%d (%d ids)\n",
                    label, i, n_chunks, length(chunk)))
      }
      
      files <- list(
        ids = paste0(chunk, collapse = ","),
        from = from_db,
        to = to_db
      )
      if (!is.null(taxId)) {
        files$taxId <- taxId
      }
      
      r <- POST(url = "https://rest.uniprot.org/idmapping/run", body = files,
                encode = "multipart", accept_json())
      stop_for_status(r)
      submission <- content(r, as = "parsed", encoding = "UTF-8")
      
      job_id <- submission[["jobId"]]
      if (is.null(job_id) || length(job_id) == 0 || is.na(job_id) || !nzchar(job_id)) {
        if (isTRUE(progress)) {
          cat(sprintf("[UniProt] %s chunk %d/%d: no jobId returned\n",
                      label, i, n_chunks))
        }
        next
      }
      if (!isJobReady(job_id, maxWaitSeconds = maxWaitSeconds)) {
        if (isTRUE(progress)) {
          cat(sprintf("[UniProt] %s chunk %d/%d: timeout/not ready\n",
                      label, i, n_chunks))
        }
        next
      }
      
      details_url <- paste("https://rest.uniprot.org/idmapping/details/",
                           job_id, sep = "")
      r <- GET(url = details_url, accept_json())
      stop_for_status(r)
      details <- content(r, as = "parsed", encoding = "UTF-8")
      
      redirect_url <- details[["redirectURL"]]
      if (is.null(redirect_url) || length(redirect_url) == 0 ||
          is.na(redirect_url) || !nzchar(redirect_url)) {
        if (isTRUE(progress)) {
          cat(sprintf("[UniProt] %s chunk %d/%d: missing redirectURL\n",
                      label, i, n_chunks))
        }
        next
      }
      chunk_result <- fetch_from_redirect(redirect_url)
      if (is.null(chunk_result)) {
        if (isTRUE(progress)) {
          cat(sprintf("[UniProt] %s chunk %d/%d: invalid redirectURL\n",
                      label, i, n_chunks))
        }
        next
      }
      
      results_list[[length(results_list) + 1]] <- chunk_result
      if (isTRUE(progress)) {
        cat(sprintf("[UniProt] %s chunk %d/%d: completed\n",
                    label, i, n_chunks))
      }
    }
    
    if (length(results_list) == 0) {
      return(NULL)
    }
    do.call(rbind, results_list)
  }
  
  accession_ids <- ids[is_accession]
  gene_like_ids <- ids[!is_accession]
  
  acc_results <- map_ids(
    values = accession_ids,
    from_db = "UniProtKB_AC-ID",
    to_db = "UniProtKB",
    chunk_size = chunk_size,
    label = "accessions"
  )
  
  gene_results <- map_ids(
    values = gene_like_ids,
    from_db = "Gene_Name",
    to_db = "UniProtKB-Swiss-Prot",
    chunk_size = chunk_size,
    taxId = taxId,
    label = "gene_names"
  )
  
  results_list <- list()
  if (!is.null(acc_results)) {
    results_list[[length(results_list) + 1]] <- acc_results
  }
  if (!is.null(gene_results)) {
    results_list[[length(results_list) + 1]] <- gene_results
  }
  
  if (length(results_list) == 0) {
    stop("No UniProt results were returned. Check identifiers and taxId.")
  }
  
  resultsTable <- do.call(rbind, results_list)
  if ("Entry" %in% colnames(resultsTable)) {
    resultsTable <- resultsTable[!duplicated(resultsTable$Entry), ]
  }
  return(resultsTable)
}



## SCRIPT ----

species <- c("human", "zebrafish", "drosophila", "worm", "yeast")
sci_names <- c("human" = "Homo sapiens", "zebrafish" = "Danio rerio", "drosophila" = "Drosophila melanogaster",
               "worm" = "Caenorhabditis elegans", "yeast" = "Saccharomyces cerevisiae")

for (org in species) {
  # look up scientific name of org
  sci_name <- sci_names[org]
  output_path <- paste0("Output/Data/", org, "_uniprot.csv")
  if(file.exists(output_path)) {
    message(paste("File", output_path, "already exists. Loading", org))
    df <- read.csv(output_path)
  } else {
    message(paste("Retrieving UniProt info for", org))
    species_ids <- read.delim(paste0("Data/", org, "_uniprot.txt"), header = FALSE)
    names(species_ids) <- c("uniprot_id")
    df <- retrieveUniprotInfo(species_ids$uniprot_id)
    # save this result
    write.csv(df, output_path, row.names = FALSE)
  }
  
  df$tms <- str_count(df$Transmembrane, "TRANSMEM")
  tm_counts <- df %>%
    group_by(tms) %>%
    summarise(count = n()) %>% 
    filter(tms > 0)
  
  p1 <- ggplot(tm_counts, aes(x = tms, y = count)) +
    geom_col(fill = "#009988") +
    labs(x = "Transmembrane domains", y = "Count",
         title = sci_name) +
    lims(x = c(0.5,NA), y = c(0,NA)) +
    theme_bw(10)
  
  p2 <- SuperPlotR::pieplot(x1 = c(sum(df$tms == 0), sum(df$tms > 0)),
                            cols = c("#bbbbbb", "#009988")) +
    # blank background and no legend
    theme_void() +
    theme(legend.position = "none")
  
  # inset p2 in p1 and add information about the percentages
  p <- ggdraw() +
    draw_plot(p1) +
    # top right
    draw_plot(p2, x = 0.9, y = 0.9, hjust = 1, vjust = 1, width = 0.4, height = 0.4) +
    draw_label(paste0("Total Proteins: ", nrow(df),
                      "\nNo TM: ", round(sum(df$tms == 0) / nrow(df) * 100, 1),
                      "%\nWith TM(s): ",
                      round(sum(df$tms > 0) / nrow(df) * 100, 1), "%"),
               x = 0.97, y = 0.85, hjust = 1, vjust = 1, size = 8)
  plot_path <- paste0("Output/Plots/", org, "_uniprot_tm_counts.png")
  ggsave(plot_path, p, width = 1600, height = 900, units = "px", dpi = 300)
}

Note that I am using {cowplot} at the end to inset the pie chart and to add the text onto the main plot.

—

The post title comes from “My Domain” by Bernard Butler from his People Move On album.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: My Domain: proteome-wide scanning of TMDs

Expanding the Editorial Team: Alec Robitaille and Lucy D’Agostino McGowan Join as Editors

rOpenSci — Thu, 16 Apr 2026 00:00:00 +0000

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

At rOpenSci, we’re continually grateful for the support and engagement of our community, who help make research open-source stronger, more inclusive, and more collaborative. The software peer review program continues to grow, and today we announce that our editorial team keeps expanding:

We’re excited to welcome Alec Robitaille and Lucy D’Agostino McGowan as new editors. Alec joins our general review team, and Lucy our statistical software review team. Their expertise and dedication will help sustain and strengthen software peer review, ensuring that software reviews continue to meet the highest standards of quality, transparency, and impact.

Meet our new editors!

Alec Robitaille

Alec is a graduate student at Memorial University of Newfoundland and Labrador (Canada) studying foraging ecology, habitat selection and social networks in caribou and other ungulates. He has been involved in many projects along the way, from measuring muskrat habitat and lake ice dynamics at McGill University to estimating drought sensitivity in Canada’s forests with the Canadian Forestry Service. Passionate about teaching open science and programming, he regularly mentors peers, runs workshops, develops example repositories, and organizes a Bayesian stats colearning group. He maintains the package spatsoc (reviewed by rOpenSci in 2018) and has developed smaller packages with diverse applications including remote sensing (irg), social networks (hwig), camera trap monitoring (camtrapmonitoring), and animal movement (distanceto). He reviewed the ohun, chopin, and emodnet.wfs packages for rOpenSci, guest edited for the rredlist package and is currently handling the reviews for the ActiGlobe, and saperlipopette packages.

Alec on GitHub, Website.

I first connected with the rOpenSci community through the review process for the package spatsoc in 2018 as part of our manuscript submission at Methods in Ecology and Evolution. During the review, I was thoroughly impressed by how welcoming the community was, and how effective the process was in helping me learn how to improve the package. rOpenSci is a landmark in the R and open science ecosystems with an ever evolving community to learn from and to be a part of. I am very grateful to be given the opportunity to continue contributing to rOpenSci in this new role as editor.

Alec L. Robitaille

Lucy D’Agostino McGowan

Lucy D’Agostino McGowan is an associate professor in the Department of Statistical Sciences at Wake Forest University. She received her PhD in Biostatistics from Vanderbilt University and completed her postdoctoral training at Johns Hopkins University Bloomberg School of Public Health. Her research focuses on causal inference, statistical communication, analytic design theory, and data science pedagogy. Lucy can be found blogging at livefreeordichotomize.com, on Blue Sky @LucyStats.bsky.social, and podcasting on Casual Inference.

Lucy on GitHub, Website.

I am so thrilled to join the rOpenSci editorial team! I love the rOpenSci community and mission and am grateful for the opportunity to contribute.

Lucy D’Agostino McGowan

About the Software Peer Review Program

rOpenSci’s software peer review program brings together volunteers to collaboratively review scientific and statistical software according to transparent, constructive, and open standards. Editors manage submissions, coordinate reviewers, and help guide packages through review to improve code quality, documentation, and usability.

This program is possible thanks to the many community members: authors submitting their packages, reviewers volunteering their time and expertise, and editors like Alec and Lucy who help managing reviews and maintaining a supportive process.

Get Involved

Are you considering submitting your package for review? These resources will help:

About rOpenSci Software Peer Review;
Browse the online book rOpenSci Packages: Development, Maintenance, and Peer Review and rOpenSci Statistical Software Peer Review;
Read public software review threads on GitHub

Would you like to review packages? Fill out the rOpenSci Reviewer Sign-Up Form to volunteer to review.

Welcome again Alec and Lucy! We’re thrilled to have you join the editorial team.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Expanding the Editorial Team: Alec Robitaille and Lucy D’Agostino McGowan Join as Editors

Stage II OSCC — Health Economics Model

Joseph Rickert — Thu, 16 Apr 2026 00:00:00 +0000

[social4i size="small" align="align-left"] -->

[This article was first published on R Works, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Most health care economics models are constructed from the perspective of a managed health care system such as those offered in Canada and several European countries, or from the perspective of some other third party such as an insurance company. Although the benefits of constructing models from the patient’s perspective have been discussed in the literature, see for example Ioannidis & Garber (2011) and Tai et al. (2016), few models have been published. Obvious reasons for this situation would include the privacy issues associated with obtaining patient specific data and the apparent lack of economic incentives to commit resources required to abstract individual patient trajectories from patient level data.

This post explores the hypothesis that detailed, data-driven, patient-specific models may not be necessary in order to achieve most of the benefits of a health economics model from the patients perspective. It present a lightweight, cohort model that captures the majority of the benefits to both physicians and patients that might result from a patient focused model.

The model is intended to be a proof of concept, a minimal viable model that may be useful to physicians both in deciding on treatment options and in informing patients about the potential outcomes and how they may experience these outcomes. To see how the medical literature may be helpful for this kind of modeling I explore the specific case of choosing between surgery and definitive radiation therapy for patients with stage II oral squamous cell carcinoma (OSCC).

Note that the model developed here is not being presented as a medical analysis. No medical experts have been consulted in its construction. It is merely being offered as an example of what could be done to build a health care economics model from a patient’s perspective with today’s modeling tools. Also note that I made significant use of Posit Assistant for the RStudio IDE configured with Claude Sonnet 4.6 which I found to be immensely helpful both for code construction and literature searches even though utility of the latter use was limited by the vexing proclivity of Claude to hallucinate.

Overview of the Model

The recovery of patients undergoing treatment OSCC, either surgery or radiation treatment, is conceived as a stochastic journey through various health states. The states visited, the sequence in which they are visited and the length of stay in each health state are modeled as random variables developing in in the framework of an eight=state, continuous-time Markov chain. Estimates of the transition probabilities among states and the mean time patients would remain in a state drive Markov chain. In the model below, these estimates are derived from the medical literature, however, it is conceivable that clinicians may feel comfortable in making their own estimates, or modifying the literature derived estimates according to their experience and their evaluations of their patients.

The next step uses the theory of continuous time Markov chains (CTMC) to construct a synthetic data set for a cohort of patients who vary in age and tumor size. The key insight here is that the synthetic data set is the underlying statistical model. Empirical survival curves and the time spent in each state and other useful quantities are then calculated from the synthetic data. The patient specific healthcare model is then constructed by estimating the utility of each health state to the patient. This is accomplished by constructing quality adjust life years (QALYs) based on EQ-5D values reported in the literature.

State Diagram for a Continuous Time Markov Chain

Packages used throughout the post

Why Most Time Series Models Fail Before They Start

M. Fatih Tüzen — Wed, 15 Apr 2026 21:00:00 +0000

[This article was first published on A Statistician's R Notebook, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

1 A model can run and still be fundamentally wrong

Many time series models fail before they even begin. Not because the software crashes. Not because the code is wrong. But because the data entering the model violate one of the most important assumptions in time series analysis: stationarity.

This is where many analyses quietly go off the rails. A model is estimated, forecasts are produced, coefficients look serious, and the graphs appear convincing. But the model may be chasing a moving target rather than learning a stable data-generating mechanism.

In this post, we will work with a real macroeconomic series rather than a toy example. The data come from the Consumer Price Index for All Urban Consumers: All Items (CPIAUCSL), published by the U.S. Bureau of Labor Statistics and distributed through FRED. FRED describes CPIAUCSL as a monthly, seasonally adjusted price index and notes that percent changes in the index are commonly used to measure inflation.

Because live API access may fail in some institutional or offline environments, this workflow uses a locally downloaded CSV file instead of fetching the series on the fly. You can download the file directly from the CPIAUCSL page on FRED.

The goal is simple: show why raw time series levels often mislead us, what stationarity really means, and why transformations such as differencing and log-differencing are not cosmetic tricks but conceptual necessities.

2 What stationarity really means

In informal language, a stationary series is one whose behavior does not drift in a systematic way over time. More formally, a weakly stationary process () satisfies three conditions:

The first condition says the mean does not change over time. The second says the variance is constant. The third says the covariance between observations depends only on the lag (k), not on calendar time itself.

This matters because a large part of classical time series modeling is built on the idea that the stochastic structure is stable. When that structure is drifting, many familiar tools become unreliable or at least much harder to interpret. A trending series can generate strong autocorrelation even when the underlying dynamic structure is weak. A persistent upward path can trick the analyst into seeing “model fit” where the model is merely inheriting inertia from the level of the series.

Put differently: without stationarity, a model may explain movement without actually explaining the mechanism.

3 Load the CPI data from a CSV file

Download the CSV file for CPIAUCSL from the official FRED series page and save it in your working directory with the name CPIAUCSL.csv. The file typically includes the columns observation_date and CPIAUCSL. FRED is the distribution platform, while the source agency for the series is the U.S. Bureau of Labor Statistics.

library(readr)
library(dplyr)
library(ggplot2)
library(tibble)
library(zoo)
library(scales)
library(patchwork)
library(tseries)

cpi_tbl <- read_csv("CPIAUCSL.csv", show_col_types = FALSE) %>%
  transmute(
    date = as.Date(observation_date),
    cpi  = as.numeric(CPIAUCSL)
  ) %>%
  arrange(date) %>%
  filter(!is.na(date), !is.na(cpi))

cpi_tbl %>% slice_head(n = 5)

# A tibble: 5 × 2
  date         cpi
       
1 1947-01-01  21.5
2 1947-02-01  21.6
3 1947-03-01  22  
4 1947-04-01  22  
5 1947-05-01  22.0

The line filter(!is.na(date), !is.na(cpi)) is important. If your CSV has an NA for a month such as October 2025, that observation is safely excluded from the analysis instead of silently breaking the workflow.

4 Start with the visual story, not the test statistic

In time series analysis, the first serious diagnostic is often visual rather than formal. That is not because tests are unimportant. It is because plots let us see the basic character of the data before we start compressing everything into a p-value.

If a series has a visible trend, changing volatility, sudden level shifts, or unusual gaps, that already tells us something about whether a stationary model is likely to behave well.

4.1 The raw CPI level

p_level <- ggplot(cpi_tbl, aes(x = date, y = cpi)) +
  geom_line(linewidth = 0.9, color = "#1B4965") +
  labs(
    title = "U.S. CPI (CPIAUCSL): level series",
    subtitle = "Monthly, seasonally adjusted index from FRED",
    x = NULL,
    y = "Index"
  ) +
  scale_y_continuous(labels = label_number()) +
  theme_minimal(base_size = 12)

p_level

Even before applying a formal statistical test, the visual pattern already tells us something important. The CPI level series does not oscillate around a stable mean; instead, it follows a persistent upward path over time. This alone raises an immediate warning against modeling the raw level series as if it were stationary.

The graph also suggests that the increase is not perfectly uniform across the entire sample. In some periods, the slope becomes steeper, indicating faster price growth, while in others the series evolves more gradually. In other words, the series appears to contain not only a long-run trend but also changes in inflation dynamics over time.

This is precisely why visual inspection should be the first step in time series analysis. Before looking at test statistics or fitting a model, we should ask a simpler question: does the series look like it fluctuates around a constant level? In this case, the answer is clearly no.

A smooth and steadily rising curve may look statistically innocent at first glance, but in practice it is often a sign that the raw series is carrying trend information that must be addressed before modeling.

4.2 Rolling summaries to deepen the visual diagnosis

A single line plot is useful, but local summaries make the visual argument sharper. Below, I compute a 24-month rolling mean and rolling standard deviation.

cpi_roll <- cpi_tbl %>%
  mutate(
    roll_mean_24 = zoo::rollmean(cpi, k = 24, fill = NA, align = "right"),
    roll_sd_24   = zoo::rollapply(cpi, width = 24, FUN = sd, fill = NA, align = "right")
  )

p_roll_mean <- ggplot(cpi_roll, aes(date, roll_mean_24)) +
  geom_line(linewidth = 0.9, color = "#2A9D8F") +
  labs(
    title = "24-month rolling mean of CPI",
    x = NULL,
    y = "Rolling mean"
  ) +
  theme_minimal(base_size = 12)

p_roll_sd <- ggplot(cpi_roll, aes(date, roll_sd_24)) +
  geom_line(linewidth = 0.9, color = "#E76F51") +
  labs(
    title = "24-month rolling standard deviation of CPI",
    x = NULL,
    y = "Rolling SD"
  ) +
  theme_minimal(base_size = 12)

p_roll_mean / p_roll_sd

If the series were approximately stationary, we would expect these rolling statistics to fluctuate around relatively stable levels over time. In particular, the rolling mean should remain close to a constant value, and the rolling standard deviation should not exhibit systematic shifts.

However, the evidence here points in the opposite direction. The rolling mean shows a clear and persistent upward drift, reinforcing what we observed in the raw series: the central tendency is not stable, but evolving over time.

The rolling standard deviation tells a more nuanced story. While it remains relatively moderate for long periods, there are noticeable fluctuations and, more importantly, a pronounced spike in recent years. This indicates that the variability of the series is not constant and may respond to underlying economic conditions or shocks.

Taken together, these two plots suggest that the series violates the key assumptions of stationarity—both in terms of mean and variance. While rolling statistics alone do not formally prove non-stationarity, they provide strong visual evidence that the raw series is not suitable for direct modeling without transformation.

5 Why raw CPI levels are a good example

CPI is ideal for illustrating this problem because the level series typically trends upward over time. That is not a defect in the data; it is what a price index often does. But from a modeling perspective, it creates trouble.

If the level keeps drifting upward, then the mean is not constant. If the size of movements changes as the level rises, the variance may also appear unstable. In such a setting, fitting a model directly to the raw series can mix long-run inflationary drift with short-run dynamic behavior.

Economically, analysts are usually not interested in the index level itself as much as they are interested in inflation, that is, the rate at which the price level changes. Statistically, this is convenient too, because transforming the series from levels to changes often brings it closer to stationarity.

6 A statistical check: the Augmented Dickey-Fuller test

Visual diagnosis matters, but it is usually not enough. A commonly used statistical tool is the Augmented Dickey-Fuller (ADF) test, which tests for the presence of a unit root. In practical terms, the test is often used to assess whether a series behaves like a non-stationary process with persistent stochastic trend.

The null hypothesis of the ADF test is that the series has a unit root. That means the burden of proof is asymmetric:

a large p-value means we do not have strong evidence against non-stationarity,
a small p-value means the data are more consistent with stationarity.

That distinction is easy to say and easy to misuse. Failing to reject the null is not the same thing as proving a series is non-stationary beyond all doubt. It simply means the test did not find enough evidence against the unit-root view.

Let us start with the raw CPI level.

adf_level <- tseries::adf.test(cpi_tbl$cpi)
adf_level

    Augmented Dickey-Fuller Test

data:  cpi_tbl$cpi
Dickey-Fuller = -0.1813, Lag order = 9, p-value = 0.99
alternative hypothesis: stationary

The Augmented Dickey–Fuller (ADF) test provides a formal way to assess whether the series contains a unit root. The null hypothesis of the test is that the series is non-stationary (i.e., it has a unit root), while the alternative hypothesis is stationarity.

In this case, the p-value is extremely high (p ≈ 0.99), meaning that we fail to reject the null hypothesis. In other words, there is no statistical evidence to support that the CPI level series is stationary.

However, this result should not be interpreted in isolation. Statistical tests and visual diagnostics should complement each other. The high p-value is entirely consistent with what we observed earlier: the series exhibits a strong upward trend and does not fluctuate around a constant mean.

Taken together, both the visual evidence and the ADF test point to the same conclusion — the raw CPI level behaves more like a drifting (unit root) process than a stationary one. This reinforces the need for transforming the series before attempting any meaningful modeling.

7 The first rescue: differencing

One of the oldest and most important ideas in time series analysis is that differencing can remove certain forms of trend. The first difference is

This transformation asks a different question. Instead of modeling the level, we model the change from one period to the next.

cpi_diff_tbl <- cpi_tbl %>%
  mutate(diff_cpi = c(NA, diff(cpi))) %>%
  filter(!is.na(diff_cpi))

p_diff <- ggplot(cpi_diff_tbl, aes(x = date, y = diff_cpi)) +
  geom_line(linewidth = 0.8, color = "#6D597A") +
  labs(
    title = "First difference of CPI",
    subtitle = "Absolute month-to-month change in the index",
    x = NULL,
    y = expression(Delta*CPI)
  ) +
  theme_minimal(base_size = 12)

p_diff

Taking the first difference removes a large part of the visible trend in the series. Compared to the raw CPI level, the differenced series fluctuates much more around a relatively stable center, which is an encouraging sign from a modeling perspective.

However, differencing does not fully solve the problem. While it helps stabilize the mean, the variability of the series still appears to change over time, particularly in more recent periods where larger fluctuations are observed. This suggests that the series may still violate the constant variance assumption.

There is also a more subtle but important issue: interpretation. The first difference represents absolute changes in the index, not relative ones. In macroeconomic data, a one-point increase in CPI does not carry the same meaning when the index is around 100 versus when it exceeds 300. As the scale of the series grows, the same absolute change reflects a smaller proportional movement.

In other words, differencing improves the statistical properties of the series, but it does not yet provide a fully consistent or interpretable measure of change. This is why we often go one step further and consider transformations based on relative (percentage) changes.

8 The more meaningful rescue: log differences

This is where the log transformation becomes more than a technical detail. Consider

For moderate changes, this is approximately the proportional growth rate. In the CPI context, it moves us from the language of index levels toward the language of inflation.

That shift is both statistical and economic.

cpi_log_tbl <- cpi_tbl %>%
  mutate(
    log_cpi = log(cpi),
    dlog_cpi = c(NA, diff(log_cpi)),
    annualized_inflation_pct = 1200 * dlog_cpi,
    yoy_inflation_pct = 100 * (cpi / lag(cpi, 12) - 1)
  )

p_dlog <- cpi_log_tbl %>%
  filter(!is.na(annualized_inflation_pct)) %>%
  ggplot(aes(x = date, y = annualized_inflation_pct)) +
  geom_line(linewidth = 0.8, color = "#D62828") +
  labs(
    title = "Monthly log-difference of CPI (annualized)",
    subtitle = "A close cousin of short-run inflation",
    x = NULL,
    y = "Percent"
  ) +
  theme_minimal(base_size = 12)

p_yoy <- cpi_log_tbl %>%
  filter(!is.na(yoy_inflation_pct)) %>%
  ggplot(aes(x = date, y = yoy_inflation_pct)) +
  geom_line(linewidth = 0.8, color = "#F4A261") +
  labs(
    title = "Year-over-year CPI inflation",
    subtitle = "A slower-moving inflation measure",
    x = NULL,
    y = "Percent"
  ) +
  theme_minimal(base_size = 12)

p_dlog / p_yoy

Two key insights emerge from these transformations.

First, moving from levels to rates of change fundamentally improves interpretability. The log-difference series represents approximate percentage changes — in this context, a close proxy for short-run inflation. This is the quantity economists actually care about. A 1% increase has the same meaning regardless of whether the index is at 100 or 300, making comparisons over time much more meaningful.

Second, the transformation has a clear impact on the statistical properties of the series. Compared to the raw level and even the first difference, the log-differenced series fluctuates more consistently around a stable mean. While it still exhibits volatility spikes and occasional outliers, the overall behavior is much closer to what we would expect from a stationary process.

The comparison between the two plots is also instructive. The monthly log-difference captures short-term fluctuations and reacts quickly to shocks, while the year-over-year inflation series smooths out this noise and highlights longer-term inflation dynamics. Both are useful, but they answer different questions.

To put it bluntly: you did not just transform the data — you changed the question.

9 Re-test after transformation

Let us apply the ADF test again, this time to the log-differenced series.

adf_dlog <- cpi_log_tbl %>%
  filter(!is.na(dlog_cpi)) %>%
  pull(dlog_cpi) %>%
  tseries::adf.test()

adf_dlog

    Augmented Dickey-Fuller Test

data:  .
Dickey-Fuller = -4.3862, Lag order = 9, p-value = 0.01
alternative hypothesis: stationary

The contrast between the two ADF test results is striking and highly informative.

For the raw CPI level, we failed to reject the null hypothesis of a unit root, indicating that the series behaves as a non-stationary process. In contrast, for the log-differenced series, the p-value drops to around 0.01, allowing us to reject the null hypothesis and conclude that the transformed series is consistent with stationarity.

This shift is not just a technical detail — it reflects a fundamental change in how the data behaves. The transformation has effectively removed the persistent trend component and brought the series closer to a stable statistical structure.

That said, the test result should always be interpreted alongside the visual evidence. The ADF test provides formal confirmation, but the intuition comes from the plots. What we saw visually — a drifting level series versus a mean-reverting transformed series — is now supported by statistical testing.

In essence, the workflow comes full circle:
we start with a problematic series, diagnose the issue visually, apply a transformation, and then verify the improvement formally.

This is the core of time series thinking.

10 A subtle but crucial point: transformation changes interpretation

This is the point where many explanations remain superficial.

When you difference a series, you are not merely “cleaning” it — you are redefining the object of analysis.

Modeling CPI levels asks how the price index evolves over time.
Modeling first differences asks how much the index changes from one period to the next.
Modeling log differences asks about proportional change, which is directly linked to inflation.

These are not equivalent statistical questions, and they are certainly not equivalent economic questions.

This is why time series preprocessing should never be treated as a mechanical step. Every transformation involves a trade-off: it improves certain statistical properties while simultaneously altering the meaning of the data.

Understanding that trade-off is not optional — it is central to sound time series analysis.

11 Why this matters for ARIMA-style modeling

ARIMA models are often presented as if the workflow were mechanical: inspect the series, difference if needed, identify orders, estimate parameters, check residuals, and forecast. While this workflow is useful, it can create the misleading impression that differencing is simply a procedural step — a box to tick.

It is not.

Differencing is a deliberate modeling choice. Its purpose is to separate persistent, trend-like behavior from shorter-run dynamics. If you skip it when it is needed, your model may inherit non-stationarity and produce unreliable or misleading inference. If you apply it excessively, you risk removing meaningful structure and end up modeling noise.

The real question, therefore, is not “Should I difference?” but rather:
What feature of the data am I trying to stabilize, and what question do I want the model to answer?

12 A compact comparison

13 A compact comparison

Series version	What it represents	Typical issue	When it helps (and when it does not)
CPI level	The price index itself	Strong trend, likely unit root	Poor starting point for stationary modeling
First difference	Absolute period-to-period change	Still scale-dependent	Reduces trend, but interpretation remains limited
Log difference	Approximate proportional change	May still show volatility bursts	More suitable for modeling inflation-type dynamics
Year-over-year change	Annual percentage change	Smoother, less responsive	Useful for communication, less suited for short-run analysis

14 Common mistakes

Most mistakes in time series analysis are not computational — they are conceptual.

Mistake 1: fitting models directly to raw levels because the plot “looks smooth.”
Smoothness is not stationarity. A strong trend can produce visually smooth series that are statistically problematic.

Mistake 2: treating differencing as a harmless default.
Differencing changes the meaning of the data. It may improve statistical properties while quietly reducing interpretability if applied without care.

Mistake 3: relying on a single test result as final truth.
The ADF test is useful, but it is only one piece of evidence. Visual inspection, domain knowledge, structural breaks, and alternative tests all matter.

Mistake 4: forgetting the economics.
In the case of CPI, the focus is typically on inflation, not the index level itself. A good transformation is one that improves statistical validity while remaining aligned with the economic question.

Taken together, these mistakes point to a simple lesson:
time series analysis is not about applying steps — it is about making informed choices.

15 Final thoughts

Most time series models do not fail because we cannot estimate them. They fail because we model the wrong object.

The raw CPI series is a clear reminder that not every observed series is ready for modeling. A trending index is rarely an appropriate input for a stationary model. Once we difference — and especially log-difference — the data, the series becomes more interpretable, more stable, and much closer to the type of process that classical time series methods are designed to handle.

So before asking whether your model is sophisticated enough, ask a more fundamental question:

Am I modeling a stable process — or just chasing drift?

In many cases, the answer to this question matters far more than whether you choose AR(1), ARIMA(1,1,1), or any other fashionable specification.

16 References and further reading

16.1 Data sources

FRED, Federal Reserve Bank of St. Louis. Consumer Price Index for All Urban Consumers: All Items (CPIAUCSL).
https://fred.stlouisfed.org/series/CPIAUCSL
FRED API documentation. St. Louis Fed Web Services: FRED® API.
https://fred.stlouisfed.org/docs/api/fred/

16.2 Core time series references

Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time Series Analysis: Forecasting and Control. Wiley.
Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.).
https://otexts.com/fpp3/
Hamilton, J. D. (1994). Time Series Analysis. Princeton University Press.

16.3 Stationarity and unit root testing

Dickey, D. A., & Fuller, W. A. (1979). Distribution of the estimators for autoregressive time series with a unit root. Journal of the American Statistical Association.
Said, S. E., & Dickey, D. A. (1984). Testing for unit roots in autoregressive-moving average models of unknown order. Biometrika.

16.4 Transformations and interpretation

Stock, J. H., & Watson, M. W. (2019). Introduction to Econometrics. Pearson.
Tsay, R. S. (2010). Analysis of Financial Time Series. Wiley.

16.5 Practical R resources

R Core Team. R: A Language and Environment for Statistical Computing.
https://www.r-project.org/
Hyndman, R. J. et al. forecast package documentation.
https://pkg.robjhyndman.com/forecast/

16.6 Suggested next steps for readers

If you want to go deeper, consider exploring:

Unit root tests beyond ADF (KPSS, Phillips–Perron)
Structural breaks and regime changes
Seasonal differencing and SARIMA models
Volatility modeling (ARCH/GARCH)

These topics build directly on the ideas discussed in this article and will deepen your understanding of time series behavior.

To leave a comment for the author, please follow the link and comment on their blog: A Statistician's R Notebook.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Why Most Time Series Models Fail Before They Start

logrittr: A Verbose Pipe Operator for Logging dplyr Pipelines

Guillaume Pressiat — Wed, 15 Apr 2026 08:05:50 +0000

[This article was first published on Guillaume Pressiat, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

dplyr verbs are descriptive: let’s make them more verbose!

Yet another pipe for R.

Motivation

In SAS, every DATA step prints a log:

NOTE: There were 120000 observations read from WORK.SALES.
NOTE: 7153 observations were deleted.
NOTE: The data set WORK.RESULT has 112847 observations and 11 variables.

R’s dplyr pipelines are silent. logrittr fills that gap with %>=%, a drop-in pipe that logs row counts, column counts, added/dropped columns, and timing at every step, with no function masking.

With Fira Code ligatures, %>=% renders as a single wide arrow visually similar to %>% with an underline added, like a subtitle or, say, to read between the lines of a pipeline (what happened).

Multiples contexts

Things happens:

NOTE: There were 120000 observations read from WORK.SALES.
NOTE: 120000 observations were deleted.
NOTE: The data set WORK.RESULT has 0 observations and 11 variables.

“It’s here we’ve lost all rows in script execution”.

Pro

Reading this a long time after execution of a script helps you see:

what happened at each stage of data processing without having to rerun the code, for example in a production environment where the input data is constantly changing
monitor key processes
Make sure you can explain what happened (an audit, for example)

In professional contexts it’s often needed.

Educational

This will also be clearer thanks to a console log for those with little experience with the tidyverse: people who are taking their first steps in programming by following a tutorial or teaching themselves.

Installation

install.packages('logrittr', repos = 'https://guillaumepressiat.r-universe.dev')

# or from github
# devtools::install_github("GuillaumePressiat/logrittr")

See github or r-universe.

Usage

library(logrittr)
library(dplyr)

iris %>=%
  as_tibble() %>=%
  filter(Sepal.Length < 5)  %>=%
  mutate(rn = row_number()) %>=%
  semi_join(
    iris %>% as_tibble() %>=%
      filter(Species == "setosa"),
    by = "Species"
  )  %>=%
  group_by(Species) %>=%
  summarise(n = n_distinct(rn))

── iris  [rows:       150  cols:    5] ─────────────────────────────────────────────────────
ℹ as_tibble()                            rows:       150 +0        cols:    5 +0    [   0.0 ms]
ℹ filter(Sepal.Length < 5)               rows:        22 -128      cols:    5 +0    [   3.0 ms]
ℹ mutate(rn = row_number())              rows:        22 +0        cols:    6 +1    [   1.0 ms]
  added: rn
ℹ > filter(Species == "setosa")          rows:        50 -100      cols:    5 +0    [   1.0 ms]
ℹ semi_join(iris %>% as_tibble() %>=%    rows:        20 -2        cols:    6 +0    [   5.0 ms]
  filter(Species == "setosa"), by =
  "Species")
ℹ group_by(Species)                      rows:        20 +0        cols:    6 +0    [   3.0 ms]
ℹ summarise(n = n_distinct(rn))          rows:         1 -19       cols:    2 -4    [   2.0 ms]
  dropped: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, rn
  added: n

Screenshot

library(dplyr)
library(logrittr)

logrittr_options(lang = "en", big_mark = ",", wrap_width = NULL, max_cols = 3)

nycflights13::flights %>=% 
  as_tibble() %>=%
  group_by(year, month, day) %>=% 
  count() %>=% 
  tidyr::pivot_wider(values_from = "n", names_from = "day") %>=% 
  glimpse()

Related package: `tidylog`

tidylog is a really neat package that gives me motivation for this one. tidylog works by masking dplyr functions, not ideal te me.

Anyway this also was a moment for me to test a new programmer tool that is used a lot for programming at this time.

logrittr uses a custom pipe operator and never touches the dplyr namespace. Its console output is colorful and informative thanks to the cli package.

Working with `lumberjack`

If you already know the lumberjack package, compatibility is available with logrittr (timings are approximates).

Calling logrittr_logger$new():

library(lumberjack)
library(dplyr)

l <- logrittr_logger$new(verbose = TRUE)
logfile <- tempfile(fileext=".-r.log.csv")

iris %L>%
   start_log(log = l, label = "iris step") %L>%
   as_tibble() %L>%
   filter(Sepal.Length < 5) %L>%
   mutate(rn = row_number()) %L>%
   group_by(Species) %L>%
   summarise(n = n_distinct(rn)) %L>%
   dump_log(file=logfile, stop = FALSE)
   

mtcars %>% 
  start_log(log = l, label = "mtcars step") %L>%
   count() %L>%
   dump_log(file=logfile, stop = TRUE)


logdata <- read.csv(logfile)

Will write logrittr log content of multiple data steps in the same csv file.

Limitations

Like tidylog, logrittr only works with dplyr pipelines on R data.frames (in memory) and is not able to do so with dbplyr pipelines from databases (remote/lazy table).
Join cardinalities nicely done in tidylog are difficult to have from the pipe as join is already done, at this time we only show N row and N col evolution (before / after).
Yes it’s another pipe, not ideal. We can dream of a with_logging(TRUE) context that will activate behaviour of logrittr pipe in |> or in %>%.

Take another pipe for a spin

logrittr prioritizes the user experience with a structured and colorful display in the console.

For now, this package is just a proof of concept that gave me a chance to experiment a bit with the cli package and few other things. But I think there’s a need for that in R, in a specific area where SAS outputs are so informative.

To leave a comment for the author, please follow the link and comment on their blog: Guillaume Pressiat.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: logrittr: A Verbose Pipe Operator for Logging dplyr Pipelines

Dealing with correlation in designed field experiments: part II

Andrea Onofri — Wed, 15 Apr 2026 00:00:00 +0000

[This article was first published on R on Fixing the bridge between biologists and statisticians, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

With field experiments, studying the correlation between the observed traits may not be an easy task. For example, we can consider a genotype experiment, laid out in randomised complete blocks, with 27 wheat genotypes and three replicates, where several traits were recorded, including yield (Yield) and weight of thousand kernels (TKW). We might be interested in studying the correlation between those two traits, but we would need to face two fundamental problems:

the concept of correlation in such a setting is not unique, as we might either consider the correlation between the plot measurements, or the correlation between the residuals or the correlation between genotype means or the correlation between block means;
the experimental units are not independent, but they are grouped by genotype and block, which invalidate all inferences based on the independence assumption.

I have dealt with these two problems (particularly the first one) in a previous post, where I gave a solution based on traditional methods of data analyses.

In this post, I would like to present a more advanced solution, that was advocated by Hans-Peter Piepho in a relatively recent manuscript (Piepho, 2018). Such a solution is based on mixed models and it was implemented in SAS, by using PROC MIXED. I spent a few hours ‘transporting’ those models in R, which turned out to be a difficult task, although, in the end, I seem to have came to an acceptable solution, which I would like to share here.

First of all, we can load the ‘WheatQuality’ dataset, that is available in the ‘statforbiology’ package; it consists of 81 records (plots) of 6 variables, i.e. the Genotype and Block factors, as well as the four responses height, TKW, weight per hectolitre and yield. The code below loads the necessary packages, the data and transforms the numeric variable ‘Block’ into a factor.

rm(list = ls())
library(statforbiology)
library(dplyr)
library(tidyr)
library(sommer)
library(nlme)
#
# Loading data
dataset <- getAgroData("WheatQuality") |>
  mutate(Block = factor(Block),
         Genotype = factor(Genotype))
head(dataset)
##     Genotype Block Height  TKW Whectol Yield
## 1 arcobaleno     1     90 44.5    83.2 64.40
## 2 arcobaleno     2     90 42.8    82.2 60.58
## 3 arcobaleno     3     88 42.7    83.1 59.42
## 4       baio     1     80 40.6    81.8 51.93
## 5       baio     2     75 42.7    81.3 51.34
## 6       baio     3     76 41.1    81.1 47.78

Piepho (2018) showed that, for an experiment like this one, all the correlation coefficients can be estimated by coding a multi-response mixed model, as follows:

\[ Y_{ijk} = \mu_i + \beta_{ik} + \tau_{ij} + \epsilon_{ijk}\]

where $Y_{ijk}$ is the response for the trait $i$, the genotype $j$ and the block $k$, $\mu_i$ is the mean for the trait $i$, $\beta_{ik}$ is the effect of the block $k$ and trait $i$, $\tau_{ij}$ is the effect of genotype $j$ for the trait $i$ and $\epsilon_{ijk}$ is the residual for the trait $i$, the genotype $j$ and the block $k$.

In the above model, the residuals $\epsilon_{ijk}$ need to be normally distributed and heteroscedastic, with trait-specific variances. Furthermore, residuals belonging to the same plot (the two observed values for the two traits) need to be correlated (correlation of errors).

Hans-Peter Piepho, in his paper, put forward the idea that the ‘genotype’ and ‘block’ effects for the two traits can be taken as random, which makes sense, because, for this application, we are mainly interested in variances and covariances. Both random effects (for the genotype and for the block terms) need to be heteroscedastic (trait-specific variance components) and there must be a correlation between the two traits.

It should be noted that, for other applications, the genotype and block effects (especially the former) might be better regarded as fixed, but we will not pursue such an idea in this post.

Fitting a bivariate model

To the best of my knowledge, there is no way to fit such a complex model with the ‘nlme’ package and related ‘lme()’ function (I’ll gave a hint later on, for a simpler model). In a previous post at this link, I have given a solution based on the ‘asreml’ package (Butler et al., 2018), but this is a paid option. In more recent times I have discovered the ‘sommer’ package (Covarrubias-Pazaran, 2016), which seems to be very useful and it is suitable to deal with the requirements of this post. The key function of ‘sommer’ is mmer(), and, in order to fit the above model, we need to specify the following components.

The response variables. When we set a bivariate model with ‘sommer’, we can ‘cbind()’ Yield and TKW.
The fixed model, that does not contain any effects, but the intercept (by default, the means for the two effects are separately estimated, as in Piepho, 2018).
The random model, that is composed by the ‘genotype’ and ‘block’ effects. For both, I specified a general unstructured variance covariance matrix, so that we can estimate two different variance components (one per trait) and one covariance component. The resulting coding is ‘~ vsr(usr(Genotype)) + vsr(usr(Block))’.
The residual structure, where the two observations in the same plot are heteroscedastic and correlated. This structure is fitted by default and it does not require any specific coding.

The model call is:

mod.bimix <- mmer(cbind(Yield, TKW) ~ 1,
                   random = ~ vsr(usr(Genotype)) + vsr(usr(Block)),
                   rcov = ~ vsr(units),  
                   data = dataset,
                  verbose = FALSE, dateWarning = FALSE)
bimix.obj <- summary(mod.bimix)
coefs <- bimix.obj$varcomp
coefs
##                           VarComp  VarCompSE     Zratio Constraint
## u:Genotype.Yield-Yield 77.6342608 22.0978545  3.5132035   Positive
## u:Genotype.Yield-TKW   38.8320973 15.0930429  2.5728475   Unconstr
## u:Genotype.TKW-TKW     53.8613303 15.3539585  3.5079768   Positive
## u:Block.Yield-Yield     3.7104682  3.9363372  0.9426195   Positive
## u:Block.Yield-TKW      -0.2428322  1.9074202 -0.1273092   Unconstr
## u:Block.TKW-TKW         1.6684549  1.8343512  0.9095613   Positive
## u:units.Yield-Yield     6.0939217  1.1951482  5.0988836   Positive
## u:units.Yield-TKW       0.1635821  0.7242898  0.2258518   Unconstr
## u:units.TKW-TKW         4.4718011  0.8770118  5.0989065   Positive

The box above shows the results about the variance-covariance parameters. In order to get the correlations, I used the delta method, as implemented in the gnlht() function of the ‘statforbiology’ package (the accompanying package for this blog). First of all I extracted the variance parameters together with the covariance matrix for the variance parameters from the mixed model object. For simplicity, I assigned simple names to the coefficients (V1, V2, … Vn), according to their ordering in model output.

# Correlation between genotype means
coefsVec <- coefs[,1]
vcovMat <- mod.bimix$sigmaSE # Variance-covariance for varcomp
names(coefsVec) <- paste("V", 1:9, sep = "")
gnlht(coefsVec, func = list(~ V2 / (sqrt(V1)*sqrt(V3) )),
      vcov. = as.matrix(vcovMat),
      parameterNames = paste("V", 1:9, sep = ""))
##                       Form  Estimate        SE  Z-value      p-value
## 1 V2/(sqrt(V1) * sqrt(V3)) 0.6005174 0.1306699 4.595684 4.313326e-06
#
# Correlation between block means
gnlht(coefsVec, func = list(~ V5 / (sqrt(V4)*sqrt(V6) ) ),
      vcov. = as.matrix(vcovMat),
      parameterNames = paste("V", 1:9, sep = ""))
##                       Form    Estimate        SE   Z-value   p-value
## 1 V5/(sqrt(V4) * sqrt(V6)) -0.09759658 0.7571256 0.1289041 0.8974335
#
# Correlation of residuals
gnlht(coefsVec, func = list(~ V8 / (sqrt(V7)*sqrt(V9) )),
      vcov. = as.matrix(vcovMat),
      parameterNames = paste("V", 1:9, sep = ""))
##                       Form   Estimate        SE   Z-value   p-value
## 1 V8/(sqrt(V7) * sqrt(V9)) 0.03133619 0.1385421 0.2261854 0.8210572

We see that the estimates are very close to those obtained by using the Pearson’s correlation coefficients (see my previous post). The advantage of this mixed model solution is that we can also test hypotheses in a relatively reliable way. For example, we can look at the Wald tests in the output above to judge about the significance of correlations and conclude that only the genotype means are significantly correlated to one another.

A solution with ‘lme()’

Considering that the block means are not correlated, if we were willing to take the ‘block’ effect as fixed, we could fit a bivariate mixed model also with the ‘nlme’ package and the function lme() (Pinheiro and Bates, 2000). However, we should cast the model as a ‘univariate’ model.

To this aim, the two variables (Yield and TKW) need to be piled up and a new factor (‘Trait’) needs to be introduced to identify the observations for the two traits. Another factor is also necessary to identify the different plots, i.e. the observational units. To perform such a restructuring, I used the pivot_longer() function in the ‘tidyr’ package (Wickham et al., 2024) and assigned the name ‘Y’ to the response variable, that is now composed by the two original variables Yield and TKW, one on top of the other.

dataset$Plot <- 1:81
mdataset <- dataset |>
  select(-Whectol, -Height) |>
  pivot_longer(names_to = "Trait", values_to = "Y", cols = c("Yield", "TKW")) |>
  mutate(Trait = factor(Trait))
head(mdataset)
## # A tibble: 6 × 5
##   Genotype   Block  Plot Trait     Y
##            
## 1 arcobaleno 1         1 Yield  64.4
## 2 arcobaleno 1         1 TKW    44.5
## 3 arcobaleno 2         2 Yield  60.6
## 4 arcobaleno 2         2 TKW    42.8
## 5 arcobaleno 3         3 Yield  59.4
## 6 arcobaleno 3         3 TKW    42.7
tail(mdataset)
## # A tibble: 6 × 5
##   Genotype Block  Plot Trait     Y
##          
## 1 vitromax 1        79 Yield  54.4
## 2 vitromax 1        79 TKW    41.6
## 3 vitromax 2        80 Yield  51.0
## 4 vitromax 2        80 TKW    43.6
## 5 vitromax 3        81 Yield  48.8
## 6 vitromax 3        81 TKW    43.1

The fixed model is:

Y ~ Trait*Block

In order to introduce a totally unstructured variance-covariance matrix for the random effect, I used the ‘pdMat’ construct:

random = list(Genotype = pdSymm(~Trait - 1))

Relating to the residuals, heteroscedasticity can be included by using the ‘weight()’ argument and the ‘varIdent’ variance function, which allows a different standard deviation per trait:

weight = varIdent(form = ~1|Trait)

I also allowed the residuals to be correlated within each plot, by using the ‘correlation’ argument and specifying a general ‘corSymm()’ correlation form. Plots are nested within genotypes, thus I used a nesting operator, as follows:

correlation = corSymm(form = ~1|Genotype/Plot)

The final model call is:

mod <- lme(Y ~ Trait*Block, 
                 random = list(Genotype = pdSymm(~Trait - 1)),
                 weight = varIdent(form=~1|Trait), 
                 correlation = corCompSymm(form=~1|Genotype/Plot),
                 data = mdataset)

Retrieving the variance-covariance matrices needs some effort, as the function ‘getVarCov()’ does not appear to work properly with this multi-stratum model. First of all, we can retrieve the variance-covariance matrix for the genotype random effect (G) and the corresponding correlation matrix.

#Variance structure for random effects
obj <- mod
G <- matrix( as.numeric(getVarCov(obj)), 2, 2 )
G
##          [,1]     [,2]
## [1,] 53.86053 38.83124
## [2,] 38.83124 77.63402
cov2cor(G)
##           [,1]      [,2]
## [1,] 1.0000000 0.6005096
## [2,] 0.6005096 1.0000000

Second, we can retrieve the ‘conditional’ variance-covariance matrix (R), that describes the correlation of errors:

#Conditional variance-covariance matrix (residual error)
V <- corMatrix(obj$modelStruct$corStruct)[[1]] #Correlation for residuals
sds <- 1/varWeights(obj$modelStruct$varStruct)[1:2]
sds <- obj$sigma * sds #Standard deviations for residuals (one per trait)
R <- t(V * sds) * sds #Going from correlation to covariance
R
##           [,1]      [,2]
## [1,] 6.0939152 0.1634968
## [2,] 0.1634968 4.4718077
cov2cor(R)
##            [,1]       [,2]
## [1,] 1.00000000 0.03131984
## [2,] 0.03131984 1.00000000

The total correlation matrix is simply obtained as the sum of G and R:

Tr <- G + R
cov2cor(Tr)
##          [,1]     [,2]
## [1,] 1.000000 0.555787
## [2,] 0.555787 1.000000

We see that the same results can be obtained by using ‘sommer’ and regarding the block effect as fixed, although the coding is below is much neater!

mod.bimix5 <- mmer(cbind(Yield, TKW) ~ Block,
                   random = ~ vsr(usr(Genotype)),
                   data = dataset,
                  verbose = FALSE, dateWarning = FALSE)
mod.bimix5$sigma
## $`u:Genotype`
##          Yield      TKW
## Yield 77.63425 38.83209
## TKW   38.83209 53.86133
## 
## $units
##           Yield       TKW
## Yield 6.0939217 0.1635824
## TKW   0.1635824 4.4718011

Hope this was useful… should you have any better solutions, I’d be happy to learn them; please, drop me a line at the address below. Thanks for reading and happy coding!

And … don’t forget to check out my new book!

Prof. Andrea Onofri
Department of Agricultural, Food and Environmental Sciences
University of Perugia (Italy)
Send comments to: andrea.onofri@unipg.it

References

Butler, D., Cullis, B.R., Gilmour, A., Gogel, B., Thomson, R., 2018. ASReml-r reference manual - version 4. UK.
Covarrubias-Pazaran, G., 2016. Genome-Assisted Prediction of Quantitative Traits Using the R Package sommer. PLOS ONE 11, e0156744. https://doi.org/10.1371/journal.pone.0156744
Piepho, H.-P., 2018. Allowing for the structure of a designed experiment when estimating and testing trait correlations. The Journal of Agricultural Science 156, 59–70.
Pinheiro, J.C., Bates, D.M., 2000. Mixed-effects models in s and s-plus. Springer-Verlag Inc., New York.
Wickham H, Vaughan D, Girlich M (2024). tidyr: Tidy Messy Data.doi:10.32614/CRAN.package.tidyr https://doi.org/10.32614/CRAN.package.tidyr, R package version 1.3.1, https://CRAN.R-project.org/package=tidyr.

This post was originally on 2025-02-10 and updated on 2026-04-15

To leave a comment for the author, please follow the link and comment on their blog: R on Fixing the bridge between biologists and statisticians.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Dealing with correlation in designed field experiments: part II

Programming with LLMs in R & Python

The Jumping Rivers Blog — Tue, 14 Apr 2026 23:59:00 +0000

[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Working with LLMs in Practice

Large Language Models are becoming part of everyday data science work. But using them through chat interfaces is only one part of the picture.

In this upcoming webinar, we focus on how to work with LLMs programmatically, using R and Python to integrate them into real workflows and applications.

Secure your place by registering through the webinar registration form

What We’ll Cover

We begin with a short introduction to how LLMs work, including how they are priced, where they perform well, and where they can fall short.

From there, the session moves into practical examples of working with LLMs in code:

Sending prompts to an LLM API from R using the {ellmer} package
Including additional instructions through system prompts
Structuring prompts to return clean, tabular outputs
Summarising images and PDFs using LLMs

While the examples will focus primarily on R, we will also briefly explore the {chatlas} package for Python, which offers similar functionality.

Why This Matters

Using LLMs through chat tools is useful for exploration, but it has limits.

For data scientists and developers, the value comes from:

Automating repetitive tasks
Embedding LLMs into applications and pipelines
Generating structured outputs that can be reused downstream

This webinar focuses on that shift, from interactive use to integration in code.

Who Should Attend

This webinar is suitable for:

Data scientists working with R or Python
Developers interested in integrating AI into applications
Teams exploring how to move from experimentation to production

No prior experience with LLM APIs is required, but familiarity with R or Python will be helpful.

Webinar Details

Date: 23rd April 2026
Time: 1:15 PM (UK time)
Location: Online
Cost: Free

Speaker

The session will be led by Myles Mitchell, Principal Data Scientist at Jumping Rivers.

If you would like to explore these topics further, our 6-hour course, LLM-Driven Applications with R and Python covers:

Building LLM-powered dashboards
Setting up retrieval-augmented generation (RAG) pipelines
Responsible use of AI

Join Us

LLMs are quickly becoming part of the standard toolkit for data science.

Understanding how to use them programmatically opens up far more possibilities than using them through chat alone.

This session is designed to give you a clear starting point.

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Programming with LLMs in R & Python

Marathon Man II: how to pace a marathon

Stephen Royle — Tue, 14 Apr 2026 12:30:00 +0000

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It’s often the way. I posted recently about how to pace a marathon and very quickly received feedback that would’ve improved the original post. Oh well, no going back. This is take two.

So, we have a dataset of all runners from the 2025 New York City Marathon. We want to know how should you pace a marathon. What is the best strategy?

Determining your optimal pace is complex. There’s the theoretical pace that you can achieve – a mix of biomechanics, physiology and training – but it can be very hard to know what this pace is. Anyway, this theoretical pace is what you could achieve when all goes well. You need to factor in the conditions on the day – how you slept, how you fuel, mental attitude, is it windy? can you get in a group and work with others? and so on. A runner may toe the line in the shape to run a sub 3 h marathon, but by the 30 km mark, the story may be very different.

In the last post, we saw that positive splitting (otherwise known as slowing down) is inevitable. So it seems the best strategy is start out faster than your goal pace, bank some time so that you account for the fade.

A reader responded with this insightful comment:

What I question, though, is whether a (very thorough) analysis of how marathons get run tells us much about how they should be run? This seems to be saying, “Forget about an optimal pace, here’s how to compensate for the sub-optimal pace you’re going to run despite your plans.”

This is correct. Any post hoc analysis like this can only tells us how the marathon was run, not about how they should be run. This is because we don’t know the intention of any runner in the dataset. If we did, then we would know how a runner intended to run the race (i.e. what their pacing strategy was) and then we could ask: did that work out for them?

If only we knew their intention… hmm…

The idea

The sub-3 marathon is one of the big goals in running. That is, trying to run it in less than 3 hours. So we know that there are a bunch of runners in the dataset trying to do just that. We know the finish times too. So by definition, the runners finishing between 02:55:00 and 03:00:00 were the folks shooting for sub-3 and who achieved it, while those finishing between 03:00:00 and 03:05:00 were those who didn’t make it. Sure, there will be some in this window who were hoping for 02:50:00 and failed and some who were hoping to do 03:10:00 and ran amazingly well. But by narrowing the window to 5 min either side of 3 h, we have fewer of those than if we took 10 min either side.

If we assume that runners in the 02:55:00 to 03:05:00 finishing window intended to run for a finish time of 3 h, we can analyse how they paced the marathon and how it worked out for them.

Analysing this window also has the advantage that runners of this calibre know how to pace well, compared to those trying for 03:30:00 or 04:00:00. There’s also plenty of them too given that it is a popular goal.

So let’s take a look. Plots first and then the code below.

Going for sub-3

We’ll use difference from goal pace to visualise runners progress. The goal pace here is ~04:15/km. Below 0 is running ahead of pace (banking time) and running above 0 means being behind schedule.

We colour the runners by whether they made it, sub 3 (red) or failed, went over 3 (blue).

770 runners were sub 3 whereas 628 were over 3. This can be difficult to see, so let’s take a different view.

For both outcomes we have several runners who were clearly shooting for a faster time, but something went wrong and they ended up in our window. They are appear as U-shapes in the difference plots. Rather than remove them, we’ll accept these contaminants and assume that most folks in this window are shooting for a 3 h finish.

We can see different pacing as the race progresses for different runners. Some folks are behind schedule but end up making sub-3, others are ahead of time and fail. To answer our question we need to know: what is the best strategy?

On-pace, positive split, negative split?

We’ll take 10 km as our marker point. It’s almost one-quarter through. Any excitement of the start with all the crowds messing up pacing is done and we can see at this point who is intending to run at what pace. Let’s say that “on pace” is 2 s/km difference from goal pace. So at 10 km, an “on pace” runner could be ± 20 s from where they should be (00:21:20). If the difference is more than 20 s we’ll say they are behind pace, and if it is greater in the other direction, they are ahead of pace.

Knowing this, we can look at the outcome. Of the runners going for 3 h, what was the best strategy?

We can see that most people do not negative split a sub-3 marathon. The majority of people making the goal, run the first 10 km (and indeed most of the race) ahead of goal pace.

There’s a risk here though, going out at faster than goal pace means that you might fail. The yellow traces really show how, at 30-35 km, the race gets very tough and people can slow down significantly. Anyone who has run a marathon will tell you that “the race only starts at the 30 km mark”. It’s where people start to hit the wall and this plot really shows that. These folks could have misjudged their theoretical best pace or just struggled on this occasion.

I find the strategy for success interesting. A lot of advice out there is to start out a marathon at an easier pace and speed up if you can. While it’s true you shouldn’t go too fast and blow up, the advice should be to train to run at more than 2 s ahead of goal pace and try to maintain that.

Tell me the odds

With all the caveats in place, let’s try and get some individual-level probabilities from our population-level data.

We looked at the 10 km point, applying a ± 2 s/km threshold for goal pace, and the behind/ahead classifications. We can do this for every waypoint that we have data for. Now, we can say for a given waypoint: of the runners that were say, ahead of pace, how many finished sub-3 (succeeded) and how many were over-3 (failed). This gives us a probability of success for that strategy at that waypoint. We can then plot these probabilities out.

If your strategy was to go ahead of pace and you were ahead of pace at 5 km, you have a 65% chance of going sub-3. If you are ahead of pace at 30 km, it climbs to a 72% chance. Obviously it keeps climbing to certain success the further the race progresses.

Running at goal pace gives a 50/50 chance of making it if you’re on pace at 5 km. But if you are only on-pace at the halfway point, your chance of success drops to 37%.

If you are behind pace at 10 km, you have a 19% chance of success and this probability drops as the race continues. Eventually, we hit the point where it is not possible to make up the time that’s lost and it is 100% likely that you will fail.

The best strategy?

The best strategy is to go out faster than goal pace and this is what you should train for.

Negative splitting is rare. Slowing down after 30 km is highly likely. Failing to account for this means potentially missing out on your goal.

This message is not too different from the previous post, but we now have some probabilities to back up advice on how the race should be run.

Caveats

Most people running a marathon are first-timers who will run this one race and their goal is to simply finish. Let’s face it, most non-runners have no idea whether your finish time was good/bad/whatever. They will just be impressed that you finished! This post is intended for repeat offenders who strive to improve their time. Maybe the best advice is to just go out there, enjoy running your marathon and not worry about pacing. It’s the best feeling in the world to have achieved it whether it’s your first or fifth.

This analysis is obviously limited to one dataset, the 2025 New York City Marathon. It has a flat profile, so any of the probabilities will likely only apply over a similarly flat course in similar conditions. I also mentioned that we assume a 3 h goal for the runners in the window and we saw how this is not perfect, but it is the best we can do. Obviously, the pacing for other goal times may be different, but we saw in the previous analysis that positive splitting is the most likely scenario regardless of pace.

The code

library(ggplot2)
library(ggtext)
library(dplyr)
library(hms)

## plot styling ----

# qBrand plot styling used. This code should run OK without

my_colours <- c("Sub 3 - Behind" = "#003d5c",
                "Sub 3 - Goal Pace" = "#954e9b",
                "Sub 3 - Ahead" = "#ff6b59",
                "Over 3 - Behind" = "#464c89",
                "Over 3 - Goal Pace" = "#dd4d88",
                "Over 3 - Ahead" = "#ffa600")

my_levels <- c("Sub 3 - Behind",
               "Sub 3 - Goal Pace",
               "Sub 3 - Ahead",
               "Over 3 - Behind",
               "Over 3 - Goal Pace",
               "Over 3 - Ahead")

## data wrangling ----

# load csv file from url
# url <- paste0("https://huggingface.co/datasets/donaldye8812/",
#               "nyc-2025-marathon-splits/resolve/main/",
#               "nyrr_marathon_2025_summary_56480_runners_WITH_SPLITS.csv")
# df <- read.csv(url)

# save locally
# write.csv(df, "Output/Data/nyc_marathon_2025_splits.csv", row.names = FALSE)          

## main script ----

df <- read.csv("Output/Data/nyc_marathon_2025_splits.csv")

times_df <- df %>%
  select(RunnerID, splitCode, time)
runners_df <- df %>%
  select(RunnerName, RunnerID, OverallTime, OverallPlace, Gender,
         Age, City, Country, Bib) %>% 
  unique()
runners_df$OverallTime <- as_hms(runners_df$OverallTime)

# unique pairs of splitCode and distance -- and add distance in km
split_distances <- df %>%
  select(splitCode, distance) %>%
  unique()
split_distances$distance <- c(4.83,5.00,6.44,8.05,9.66,10.00,11.27,12.87,14.48,
                              15.00,16.09,17.70,19.31,20.00,20.92,21.08,22.53,
                              24.14,25.00,25.75,27.36,28.97,30.00,30.58,32.19,
                              33.80,35.00,35.41,37.01,38.62,40.00,40.23,41.84,
                              42.20)

# merge split distances with times_df
times_df <- merge(times_df, split_distances, by = "splitCode", sort = FALSE)

# order the table by RunnerID and then by distance
times_df <- times_df[order(times_df$RunnerID, times_df$distance), ]
row.names(times_df) <- NULL
# time is character, change it
times_df$time <- as_hms(times_df$time)

# make a df of RunnerID, OverallTime, and a new column called Category which is
# "Sub 3" or "Over 3"
category_df <- runners_df %>%
  select(RunnerID, OverallTime) %>% 
  filter(OverallTime > as_hms("02:55:00") & OverallTime <= as_hms("03:05:00")) %>%
  mutate(Category = ifelse(OverallTime <= as_hms("03:00:00"), "Sub 3", "Over 3"))

# merge category_df with times_df to get the pace for each runner in each
# category and drop any rows with NA values
times_df <- merge(times_df, category_df,
                  by = "RunnerID", all.x = TRUE, sort = FALSE) %>%
  filter(!is.na(Category)) %>%
  mutate(on_par = time - (as_hms("03:00:00") /42.19 * distance))

ggplot(times_df, aes(x = distance, y = on_par, group = RunnerID, color = Category)) +
  geom_abline(slope = 0, intercept = 0, linetype = "dashed", color = "black") +
  geom_line(alpha = 0.2) +
  scale_color_manual(values = c("Sub 3" = "#ff6b59", "Over 3" = "#464c89")) +
  labs(title = "Difference from Goal Pace for Sub-3 and Over-3 Runners in NYC Marathon 2025",
       x = "Distance (km)",
       y = "Difference from Goal Pace (seconds)",
       color = "Category") +
  theme_q() +
  guides(colour = guide_legend(override.aes = list(alpha = 1)))
ggsave("Output/Plots/sub_over_3_comparison.png", width = 7, height = 4, dpi = 300)

ggplot() +
  geom_abline(slope = 0, intercept = 0, linetype = "dashed", color = "black") +
  geom_line(data = times_df %>%
              filter(Category == "Sub 3"),
            aes(x = distance, y = on_par, group = RunnerID),
            color = "grey", alpha = 0.2) +
  geom_line(data = times_df %>%
              filter(Category == "Over 3"),
            aes(x = distance, y = on_par, group = RunnerID),
            color = "#464c89", alpha = 0.2) +
  labs(title = "Over-3 Runners in NYC Marathon 2025",
       x = "Distance (km)",
       y = "Difference from Goal Pace (seconds)") +
  theme_q()
ggsave("Output/Plots/over_3_comparison.png", width = 7, height = 4, dpi = 300)

ggplot() +
  geom_abline(slope = 0, intercept = 0, linetype = "dashed", color = "black") +
  geom_line(data = times_df %>% filter(Category == "Over 3"),
            aes(x = distance, y = on_par, group = RunnerID),
            color = "grey", alpha = 0.2) +
  geom_line(data = times_df %>% filter(Category == "Sub 3"),
            aes(x = distance, y = on_par, group = RunnerID),
            color = "#ff6b59", alpha = 0.2) +
  labs(title = "Sub-3 Runners in NYC Marathon 2025",
       x = "Distance (km)",
       y = "Difference from Goal Pace (seconds)") +
  theme_q()
ggsave("Output/Plots/sub_3_comparison.png", width = 7, height = 4, dpi = 300)

# classify on_par into three categories: "Ahead of Par" for values less than
# -20, "On Par" for values between -20 and 20, and "Behind Par" for values
# greater than 20 at the 10K mark, i.e. 2 seconds per km * 10 km = 20 seconds
class_df <- times_df %>%
  mutate(par_category = case_when(
    distance == 10 & on_par < -20 ~ "Ahead",
    distance == 10 & on_par >= -20 & on_par <= 20 ~ "Goal Pace",
    distance == 10 & on_par > 20 ~ "Behind",
    TRUE ~ NA_character_
  )) %>% 
  filter(!is.na(par_category))
# paste Category and par_category together to make a new column called final_category
class_df <- class_df %>%
  mutate(final_category = paste(Category, par_category, sep = " - ")) %>% 
  select(RunnerID, final_category)
# merge class_df with times_df to get the final_category for each runner in each category and drop any rows with NA values
times_df <- merge(times_df, class_df, by = "RunnerID", all.x = TRUE) %>%
  filter(!is.na(final_category))
# use my_levels to get facets in the order of my_level
times_df$final_category <- factor(times_df$final_category, levels = my_levels)
# ggplot of on_par by distance colored by final_category
ggplot(times_df, aes(x = distance, y = on_par, group = RunnerID, color = final_category)) +
  geom_abline(slope = 0, intercept = 0, linetype = "dashed", color = "black") +
  geom_line(alpha = 0.2) +
  scale_color_manual(values = my_colours) +
  labs(title = "Pacing at 10 km and Overall Outcome",
       x = "Distance (km)",
       y = "Difference from Par Time (seconds)",
       color = "Category") +
  theme(legend.position = "none") +
  facet_wrap(~ final_category) +
  theme_q() +
  guides(colour = guide_legend(override.aes = list(alpha = 1)))
ggsave("Output/Plots/pacing_by_final_category.png", width = 10, height = 6, dpi = 300)


# calculate the probability of success

# list of unique distances in numerical order
distance_list <- sort(unique(times_df$distance))

all_p_df <- tibble()
for(i in 1:length(distance_list)) {
  dist <- distance_list[i]
  par <- as_hms("00:00:02") * dist
  class_df <- times_df %>%
    mutate(par_category = case_when(
      distance == dist & on_par < -par ~ "Ahead",
      distance == dist & on_par >= -par & on_par <= par ~ "Goal Pace",
      distance == dist & on_par > par ~ "Behind",
      TRUE ~ NA_character_
    )) %>% 
    filter(!is.na(par_category)) %>%
    select(RunnerID, Category, par_category) %>%
    group_by(Category, par_category) %>%
    summarise(count = n()) %>%
    group_by(par_category) %>%
    mutate(percentage = count / sum(count) * 100) %>%
    ungroup() %>% 
    mutate(distance = dist) %>% 
    select(distance, Category, par_category, percentage)
  all_p_df <- rbind(all_p_df, class_df)
}

all_p_df$final_category <- paste(all_p_df$Category, all_p_df$par_category, sep = " - ")
all_p_df$final_category <- factor(all_p_df$final_category, levels = my_levels)
all_p_df %>% 
  filter(grepl("^Sub 3", final_category)) %>% 
  ggplot(aes(x = distance, y = percentage, colour = final_category)) +
  geom_line() +
  scale_color_manual(values = my_colours) +
  labs(title = "Probability of Success for Pacing Strategies by Distance",
       x = "Distance (km)",
       y = "Probability of Sub-3 (%)",
       color = "Category") +
  theme_q()
ggsave("Output/Plots/probability_of_success_sub_3.png", width = 7, height = 4, dpi = 300)

—

The post title comes from “Marathon Man” by Ian Brown from his “My Way” album.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Marathon Man II: how to pace a marathon

Do AI coding agents save scientists time?

Seascapemodels — Mon, 13 Apr 2026 14:00:00 +0000

[This article was first published on Seascapemodels, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m often asked if using AI coding agents saves time. Yes they write code very quickly and can complete entire ecological data analyses.

Do agents really help when the deadlines are approaching?

But the code also requires careful checking for logical errors. Our recent analysis shows this. The best LLMs can complete entire analyses and all the code works well. But there was a decent chance of subtle logical errors. These logical errors would require pretty deep human understanding of the code to correct.

There’s another issue and that is using code you don’t understand. I often find the agents produce so much code, but I’m not comfortable using it until I understand the logic line-by-line.

In those cases I find its faster to use an autocomplete AI assistant so I’m going line-by-line, rather than an agentic loop that completes the entire piece of work.

I think the jury is still out on this question of whether there is a net time benefit to using agents. The only way to really answer is a randomised control trial where you time how long it takes scientists to fully complete a task.

The only study I’m aware of is quite limited and looked at software developers. They found the developers often projected they would be faster with the AI tools, but were actually slower at tasks by the end of the project.

Its true that using AI is fun because it makes so much progress, but that fun feeling might be a trap for us.

Its likely that the answer is context dependent.

I suspect for scientists most of the coding we do (like writing models that represent ecosystems) actually requires the human to understand what it does. In these cases agents don’t make sense, because you need to go back and review the code carefully to understand it anyway.

On the other hand, if you are making software tools that are easy to verify then agents are great. For instance, I often use them to write code for non-standard figures. I don’t need to know the code in that case because I can check the output is correct visually.

Likewise interactive shiny apps are another example of time saving. The agent can take some (good) code you already have and turn it into an app. Its easy to test and check because you just use the app.

People often point to advances in LLMs and say that soon they will be good enough to do all the coding for us. I’m not so sure that applies to science. In fact, we found the later version of Claude Sonnet performed about the same as an earlier version on scientific logic, it just made different types of errors.

I think the advances need to come in the ways we interact and use the LLMs.

The ultimately goal should be efficient but also high quality work. That’s something I want to look at in my next agentic AI study.

To leave a comment for the author, please follow the link and comment on their blog: Seascapemodels.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Do AI coding agents save scientists time?

reviser: Analyzing Real-Time Data Revisions in R

rOpenSci — Mon, 13 Apr 2026 00:00:00 +0000

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Economic data are rarely static. Gross domestic product (GDP), inflation, employment, and other official statistics arrive as early estimates, then get revised as new source data arrive, seasonal adjustment is updated, or benchmarking changes are applied. Those revisions matter because they can change the narrative around turning points, policy mistakes, and forecast performance.

reviser is an R package by Marc Burri and Philipp Wegmüller for working with these vintage datasets directly. A vintage dataset records multiple published versions of the same time series, so you can compare what was known at each release date with what was reported later. reviser gives you a consistent workflow to:

reshape release vintages between wide and tidy formats;
extract revisions relative to earlier or final releases;
summarize bias, dispersion, and serial dependence in revisions;
identify the first release that is statistically close to the eventual benchmark;
nowcast future revisions with state-space models.

The package is aimed at users who already work with real-time macroeconomic data and want tools that go beyond plotting revision triangles by hand. One design goal is to keep that workflow in pure R.

reviser was reviewed through rOpenSci statistical software peer review. Many thanks to reviewers Alex Gibberd, and Tanguy Barthelemy, and to editor Rebecca Killick, for feedback that improved the package.

Why revision analysis deserves its own workflow

Revisions are not just measurement noise. They encode how information enters the data-production process.

Some revisions reflect genuinely new information.
Others reflect noise that could, in principle, have been reduced earlier.
Still others come from methodological changes or benchmark updates.

These distinctions matter if you are evaluating early data releases, building nowcasts, or asking whether first releases are already efficient summaries of the available information.

The reviser vignettes organize this workflow into three layers:

Structure vintages consistently.
Measure and test revision properties.
Model the revision process when you want to predict future changes.

A compact example with GDP vintages

The first step is to reshape the data into a tidy vintage format, where each row corresponds to an observed value, the date it refers to, and the publication date of that estimate.

The package ships with a GDP example dataset in long vintage format. Suppose we want to focus on U.S. GDP growth, visualize how estimates moved during the 2008-09 global financial crisis, and then ask whether early releases were systematically biased relative to a later benchmark.

library(reviser)
library(dplyr)
library(tsbox)

gdp_us <- gdp |>
 filter(id == "US") |>
 tsbox::ts_pc() |>
 tsbox::ts_span(start = "1980-01-01")

gdp_wide <- vintages_wide(gdp_us)
gdp_long <- vintages_long(gdp_wide, keep_na = FALSE)

With the vintages in tidy form, we can plot how the published path changed over time. The y-axis in the figure reports quarter-on-quarter GDP growth rates.

plot_vintages(
 gdp_long |>
 filter(
 pub_date >= as.Date("2009-01-01"),
 pub_date < as.Date("2010-01-01"),
 time > as.Date("2008-01-01"),
 time < as.Date("2010-01-01")
 ),
 type = "line",
 title = "Revisions of GDP during the 2008-09 global financial crisis",
 ylab = "Quarter-on-quarter GDP growth rate"
)

GDP growth vintages for the United States during the 2008-09 global financial crisis.

During volatile periods, the vintage paths can diverge enough that the story told by the first release is noticeably different from the story told a year later.

Once the data are in tidy vintage form, you can compare a set of early releases to a later benchmark release.

final_release <- get_nth_release(gdp_long, n = 10)
early_releases <- get_nth_release(gdp_long, n = 0:6)

summary_tbl <- get_revision_analysis(early_releases, final_release)

Warning: Both 'release' and 'pub_date' columns are present in 'df.
The 'release' column will be used for grouping.

summary_tbl |>
 select(id, release, `Bias (mean)`, `Bias (robust p-value)`, `Noise/Signal`)


=== Revision Analysis Summary ===
# A tibble: 7 × 5
id release `Bias (mean)` `Bias (robust p-value)` `Noise/Signal`
    
1 US release_0 -0.014 0.52 0.22
2 US release_1 -0.015 0.425 0.202
3 US release_2 -0.013 0.507 0.205
4 US release_3 -0.003 0.851 0.194
5 US release_4 -0.014 0.326 0.157
6 US release_5 -0.021 0.181 0.152
7 US release_6 -0.018 0.202 0.13
=== Interpretation ===
id=US, release=release_0:
• No significant bias detected (p = 0.52 )
• Moderate revision volatility (Noise/Signal = 0.22 )
id=US, release=release_1:
• No significant bias detected (p = 0.425 )
• Moderate revision volatility (Noise/Signal = 0.202 )
id=US, release=release_2:
• No significant bias detected (p = 0.507 )
• Moderate revision volatility (Noise/Signal = 0.205 )
id=US, release=release_3:
• No significant bias detected (p = 0.851 )
• Moderate revision volatility (Noise/Signal = 0.194 )
id=US, release=release_4:
• No significant bias detected (p = 0.326 )
• Moderate revision volatility (Noise/Signal = 0.157 )
id=US, release=release_5:
• No significant bias detected (p = 0.181 )
• Moderate revision volatility (Noise/Signal = 0.152 )
id=US, release=release_6:
• No significant bias detected (p = 0.202 )
• Moderate revision volatility (Noise/Signal = 0.13 )

This is where reviser moves beyond a reshape-and-plot package. The revision summary reports quantities that applied work often needs but usually rebuilds ad hoc: mean bias, quantiles, volatility, noise-to-signal ratios, and hypothesis tests for bias, serial correlation, and news-versus-noise interpretations.

In the bundled example, the early U.S. GDP releases over this sample show little evidence of systematic bias relative to the later benchmark. The package also supports efficient-release diagnostics, where the question is not only whether revisions exist, but when additional revisions stop adding meaningful information.

efficient_release <- get_first_efficient_release(early_releases, final_release)
summary(efficient_release)

Efficient release: 0
Model summary:
Call:
stats::lm(formula = formula, data = df_wide)
Residuals:
Min 1Q Median 3Q Max
-0.89186 -0.12669 0.02046 0.11475 0.97986
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.00299 0.02223 0.134 0.893
release_0 0.97412 0.01692 57.567 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2518 on 166 degrees of freedom
(10 observations deleted due to missingness)
Multiple R-squared: 0.9523, Adjusted R-squared: 0.952
F-statistic: 3314 on 1 and 166 DF, p-value: < 2.2e-16
Test summary:
Linear hypothesis test:
(Intercept) = 0
release_0 = 1
Model 1: restricted model
Model 2: final ~ release_0
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 168
2 166 2 1.9283 0.1486

That is exactly the kind of result that is hard to see from a revision triangle alone but straightforward to formalize once the workflow is standardized. In this sample, the result points to the first release as already being statistically close to the later benchmark, which suggests subsequent revisions add relatively little systematic information.

From descriptive analysis to revision nowcasting

For many users, revision summaries will be the main use case. But reviser also includes model-based tools for users who want to treat revisions as an explicit latent-data problem. That matters if you need to make decisions on preliminary data but also want a structured way to estimate how those figures are likely to change later.

Two vignettes walk through nowcasting revisions with:

the generalized Kishor-Koenig family via kk_nowcast();
the Jacobs-Van Norden model via jvn_nowcast().

Both approaches cast the revision problem into state-space form, which makes it possible to estimate the dynamics of news and noise in successive data releases. For technical users, this is the part of the package that turns revision analysis from retrospective diagnosis into a forecasting problem.

Here is a compact kk_nowcast() example following the Kishor-Koenig workflow from the vignette for the Euro Area (EA).
The key idea is to first identify an efficient release e, then estimate the revision system on the corresponding panel of releases. In this euro area example, the efficient-release step selects e = 2, so the model treats the third published release as the earliest one that is already close to the later benchmark. That is a useful substantive result on its own: it suggests that most of the economically relevant signal arrives within the first few releases, while later revisions are smaller adjustments around that path.

gdp_ea <- reviser::gdp |>
 tsbox::ts_pc() |>
 dplyr::filter(
 id == "EA",
 time >= min(pub_date),
 time <= as.Date("2020-01-01")
 )

releases <- get_nth_release(gdp_ea, n = 0:14)
final_release <- get_nth_release(gdp_ea, n = 15)

efficient_release <- get_first_efficient_release(releases, final_release)

fit_kk <- kk_nowcast(
 df = efficient_release$data,
 e = efficient_release$e,
 model = "KK",
 method = "MLE"
)

summary(fit_kk)


=== Kishor-Koenig Model ===
Convergence: Success
Log-likelihood: 125.7
AIC: -231.41
BIC: -198.23
Parameter Estimates:
Parameter Estimate Std.Error
F0 0.633 0.131
G0_0 0.950 0.031
G0_1 -0.037 0.152
G0_2 -0.181 0.220
G1_0 -0.009 0.011
G1_1 0.594 0.061
G1_2 0.194 0.092
v0 0.380 0.068
eps0 0.008 0.001
eps1 0.001 0.000

plot(fit_kk)

Diagnostic plot from the Kishor-Koenig nowcast example.

The fitted object contains estimated parameters, filtered and smoothed latent states, and plotting methods for the implied efficient-release path. That gives you a direct route from descriptive revision analysis to a state-space nowcast of future revisions. For a broader audience, the main takeaway is not the individual coefficients. It is that the model converges cleanly on this sample, summarizes the revision process in a compact latent-state form, and provides a practical way to judge whether a new release is likely to be revised materially later on. Substantively, the model separates persistent signal from transitory revision noise, so the output is useful when you want to judge whether new releases are likely to be revised materially.

What reviser adds

What stands out in reviser is not a single function, but the coherence of the workflow:

the package has explicit conventions for vintage data;
descriptive revision analysis and formal testing sit in the same API;
efficient-release analysis connects directly to applied questions about which release to trust;
nowcasting tools extend the same workflow rather than forcing a separate modeling stack.

If you work with real-time macroeconomic data, that combination is useful because revision analysis is usually fragmented across custom scripts, one-off spreadsheets, and package combinations that do not share a common data structure.

Try it and push it further

You can install the package from the rOpenSci R-universe:

install.packages(
 "reviser",
 repos = c("https://ropensci.r-universe.dev", "https://cloud.r-project.org")
)

Then start with the package site and vignettes:

docs: https://docs.ropensci.org/reviser
source: https://github.com/ropensci/reviser

We would be happy to hear feedback from those of you trying out the package with different datasets. If you have a real-time dataset with a different release structure, that would be a good stress test for the package. If you find gaps in the workflow or have a use case to share, open an issue or contribute an example. Revision analysis gets more useful as it becomes easier to compare workflows across datasets rather than rebuilding them from scratch each time.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: reviser: Analyzing Real-Time Data Revisions in R

`mlS3` — A Unified S3 Machine Learning Interface in R

T. Moudiki — Sun, 12 Apr 2026 00:00:00 +0000

[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Overview

Following the R6 object-based package unifiedml introduced last week, this blog post introduces the mlS3 R package, which strives to provide a unified, consistent S3 interface for training and predicting from a variety of popular machine learning models. Rather than learning the idiosyncratic API of each package (you’d still need to read their docs to see the parameters specification, though), mlS3 wraps them under a common wrap_* / predict() pattern.

What You’ll Learn

How to install and load mlS3 (for now, from GitHub)
How to apply a consistent API across multiple ML algorithms for both classification and regression tasks

Models Covered

Wrapper	Underlying Package	Task(s)
`wrap_glmnet()`	`glmnet` generalized linear models	Classification, Regression
`wrap_lightgbm()`	`lightgbm` gradient boosting	Classification, Regression
`wrap_ranger()`	`ranger` random forest	Classification, Regression
`wrap_svm()`	`e1071` support vector machines	Classification, Regression
`wrap_caret()`	`caret` package	Classification, Regression with caret 200+ models

Datasets Used

iris — binary and multiclass classification (setosa/versicolor, all three species)
mtcars — regression to predict miles per gallon (mpg)

Key Design Principle

All models follow the same two-step workflow:

mod  <- wrap_*(X_train, y_train, ...)       # Train
pred <- predict(mod, newx = X_test, ...)    # Predict

This makes it easy to swap algorithms and compare results without rewriting your pipeline.

Prerequisites

R with the following packages: remotes, caret, randomForest, ggplot2
mlS3 installed from GitHub (for now) via remotes::install_github("Techtonique/mlS3")

Code

Install packages

install.packages(c("remotes"))

install.packages(c("caret"))

install.packages(c("randomForest"))

remotes::install_github("Techtonique/mlS3")

Predefined wrappers

# Classification

library(mlS3)

# =============================================================================
# Classification examples (no leakage)
# =============================================================================
set.seed(123)

# --- Binary classification: iris setosa vs versicolor ---
iris_bin <- iris[iris$Species != "virginica", ]
X_bin <- iris_bin[, 1:4]
y_bin <- droplevels(iris_bin$Species)

# Split into train/test
idx_bin <- sample(nrow(X_bin), 0.7 * nrow(X_bin))
X_bin_train <- X_bin[idx_bin, ]
y_bin_train <- y_bin[idx_bin]
X_bin_test  <- X_bin[-idx_bin, ]
y_bin_test  <- y_bin[-idx_bin]

# glmnet
mod <- wrap_glmnet(X_bin_train, y_bin_train, family = "binomial")
pred_bin_glmnet <- predict(mod, newx = X_bin_test, type = "class")
acc_glmnet <- mean(pred_bin_glmnet == y_bin_test)

cat("Accuracy (glmnet): ", acc_glmnet, "\n")


# --- Multiclass classification: iris all species ---
X_multi <- iris[, 1:4]
y_multi <- iris$Species

# Split into train/test
idx_multi <- sample(nrow(X_multi), 0.7 * nrow(X_multi))
X_multi_train <- X_multi[idx_multi, ]
y_multi_train <- y_multi[idx_multi]
X_multi_test  <- X_multi[-idx_multi, ]
y_multi_test  <- y_multi[-idx_multi]

# lightgbm
mod <- wrap_lightgbm(X_multi_train, y_multi_train,
                     params = list(objective = "multiclass",
                                   num_class = 3, verbose = -1),
                     nrounds = 150)
pred_multi_lightgbm <- predict(mod, newx = X_multi_test, type = "class")
acc_lightgbm <- mean(pred_multi_lightgbm == y_multi_test)

# ranger
mod <- wrap_ranger(X_multi_train, y_multi_train, num.trees = 100L)
pred_multi_ranger <- predict(mod, newx = X_multi_test, type = "class")
acc_ranger <- mean(pred_multi_ranger == y_multi_test)

# svm
mod <- wrap_svm(X_multi_train, y_multi_train, kernel = "radial")
pred_multi_svm <- predict(mod, newx = X_multi_test, type = "class")
acc_svm <- mean(pred_multi_svm == y_multi_test)

cat("Accuracy (lightgbm): ", acc_lightgbm, "\n")
cat("Accuracy (ranger): ", acc_ranger, "\n")
cat("Accuracy (svm): ", acc_svm, "\n")


# Regression


# =============================================================================
# Regression examples (mtcars)
# =============================================================================
X_reg <- mtcars[, -1]
y_reg <- mtcars$mpg

# Split into train/test
set.seed(123)
idx_reg <- sample(nrow(X_reg), 0.7 * nrow(X_reg))
X_reg_train <- X_reg[idx_reg, ];  y_reg_train <- y_reg[idx_reg]
X_reg_test  <- X_reg[-idx_reg, ]; y_reg_test  <- y_reg[-idx_reg]

# lightgbm
mod <- wrap_lightgbm(X_reg_train, y_reg_train,
                     params = list(objective = "regression", verbose = -1),
                     nrounds = 50)
pred_reg_lightgbm <- predict(mod, newx = X_reg_test)
rmse_lightgbm <- sqrt(mean((pred_reg_lightgbm - y_reg_test)^2))

# glmnet
mod <- wrap_glmnet(X_reg_train, y_reg_train, alpha = 0)
pred_reg_glmnet <- predict(mod, newx = X_reg_test)
rmse_glmnet <- sqrt(mean((pred_reg_glmnet - y_reg_test)^2))

# svm
mod <- wrap_svm(X_reg_train, y_reg_train)
pred_reg_svm <- predict(mod, newx = X_reg_test)
rmse_svm <- sqrt(mean((pred_reg_svm - y_reg_test)^2))

# ranger
mod <- wrap_ranger(X_reg_train, y_reg_train, num.trees = 100L)
pred_reg_ranger <- predict(mod, newx = X_reg_test)
rmse_ranger <- sqrt(mean((pred_reg_ranger - y_reg_test)^2))

cat("RMSE (lightgbm): ", rmse_lightgbm, "\n")
cat("RMSE (glmnet): ", rmse_glmnet, "\n")
cat("RMSE (svm): ", rmse_svm, "\n")
cat("RMSE (ranger): ", rmse_ranger, "\n")



Accuracy (glmnet):  1 
Accuracy (lightgbm):  0.2444444  # I'm probably doing something wrong here
Accuracy (ranger):  0.9333333 
Accuracy (svm):  0.9333333 
RMSE (lightgbm):  4.713678 
RMSE (glmnet):  2.972557 
RMSE (svm):  2.275837 
RMSE (ranger):  2.067692

`caret` wrapper

For this part, you need to install package caret and randomForest. Model parameters available for caret: https://topepo.github.io/caret/available-models.html

library(mlS3)
library(caret)

# ============================================================================
# Regression with mtcars dataset
# ============================================================================
data(mtcars)

# Prepare data
X_reg <- mtcars[, -1]  # All except mpg
y_reg <- mtcars$mpg     # Target variable

# Split into train/test
set.seed(123)
idx_reg <- sample(nrow(X_reg), 0.7 * nrow(X_reg))
X_reg_train <- X_reg[idx_reg, ]
y_reg_train <- y_reg[idx_reg]
X_reg_test <- X_reg[-idx_reg, ]
y_reg_test <- y_reg[-idx_reg]

# ----------------------------------------------------------------------------
# Example 1: Random Forest with specific parameters
# ----------------------------------------------------------------------------
cat("\n=== Example 1: Random Forest Regression ===\n")

mod_rf <- wrap_caret(X_reg_train, y_reg_train,
                     method = "rf",
                     mtry = 3)        # Number of variables sampled at each split

print(mod_rf)

# Predictions
pred_rf <- predict(mod_rf, newx = X_reg_test)
rmse_rf <- sqrt(mean((pred_rf - y_reg_test)^2))
r2_rf <- 1 - sum((y_reg_test - pred_rf)^2) / sum((y_reg_test - mean(y_reg_test))^2)

cat("RMSE:", round(rmse_rf, 3), "\n")
cat("R-squared:", round(r2_rf, 3), "\n")


=== Example 1: Random Forest Regression ===
$model
Random Forest 

22 samples
10 predictors

No pre-processing
Resampling: None 

$task
[1] "regression"

$method
[1] "rf"

$parameters
$parameters$mtry
[1] 3


attr(,"class")
[1] "wrap_caret"
RMSE: 2.007 
R-squared: 0.681 

library(ggplot2)

df <- data.frame(
  pred = pred_rf,
  actual = y_reg_test
)

ggplot(df, aes(x = pred, y = actual)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  theme_minimal() +
  labs(x = "Predicted", y = "Actual")

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: `mlS3` — A Unified S3 Machine Learning Interface in R

Test & Roll: Why Smaller A/B Tests Can Make More Money

Florian Teschner — Sun, 12 Apr 2026 00:00:00 +0000

[This article was first published on Florian Teschner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Short practical advice on A/B testing:

Stop sizing tests only for statistical significance – In finite campaigns, your goal is profit, not perfect inference.
Treat testing as a trade-off – Every extra test exposure buys learning but also burns revenue if that exposure gets the weaker treatment.
Use smaller tests when outcomes are noisy – This paper shows profit-maximizing test sizes rise much more slowly than classical power-based sizes.
Scale test size with reachable audience – If your population is limited, test size should reflect that constraint directly.
Allow unequal splits when priors differ – If one treatment is likely better a priori (e.g., treatment vs holdout), asymmetric test cells can be optimal.

Shiny App to test the implications:

Test and Roll Shiny App

Long Version

I just read Test & Roll: Profit-Maximizing A/B Tests by Elea McDonnell Feit and Ron Berman (2019), and it challenges one of the default habits in marketing experimentation: planning tests as if the main objective were statistical significance.

Their point is simple: in most real marketing experiments, you have a finite population (email list, campaign budget, limited traffic window). In that setting, the right objective is total expected profit across test + rollout, not p-values.

The core idea

A classic A/B setup has two stages:

Test stage: expose n1 users to treatment A and n2 users to treatment B.
Roll stage: deploy the winner to the remaining N - n1 - n2 users.

Bigger tests improve certainty, but they also create opportunity cost: more users in test means more users potentially seeing the weaker treatment before rollout.

The paper formalizes this as a decision problem and derives profit-maximizing sample sizes. Under Normal priors and Normal outcomes, they get closed-form solutions.

Why this matters in practice

If you use classical hypothesis-test sizing, recommended n can be huge, especially when effect sizes are small and response is noisy (which is exactly what we see in advertising).

Their framework produces much smaller test sizes because it optimizes business outcomes, not Type I/II error control.

Two important takeaways:

Optimal test sizes grow sub-linearly with response noise, while classical sample size rules grow much faster.
Optimal test sizes scale with the square root of population size N, which makes them workable for smaller markets and finite campaigns.

Comparison with bandits

The authors benchmark against Thompson sampling (multi-armed bandit). Bandits usually win on pure optimization, but the gap is often modest in their examples.

That is useful operationally: a two-stage “test then roll” process is far easier to implement, explain, and govern than a continuously-adapting bandit, especially in organizations with approval and reporting constraints.

The applications are the best part

They test the approach in three contexts:

Website design experiments
Display advertising decisions
Catalog holdout tests

Across cases, profit-maximizing designs use substantially smaller test cells than classical power calculations and produce higher expected profit.

A particularly practical result: small holdout groups (common in catalog and CRM practice) can be fully rational when priors are asymmetric. In other words, “unequal splits” are not always bad design; they can be the optimal design.

What I changed in my own thinking

Before this, I treated “underpowered” mostly as a red flag. After this paper, I think a better question is:

Underpowered for what objective?

If the objective is publication-grade inference, classical power logic is right. If the objective is campaign profit in a finite horizon, a smaller test can be the better business decision.

Practical implementation checklist

If you run tactical tests (email, paid media, landing pages), this paper suggests a better workflow:

Define total reachable population N for the decision horizon.
Set priors for treatment means from past similar experiments.
Estimate response variance from historical data.
Compute profit-maximizing n1, n2.
Pre-commit the rollout decision rule (posterior expected profit winner).
Report expected regret alongside expected upside.

That last point is underrated: decision-makers usually understand “expected dollars at risk” better than p-values.

Bottom line

For many real marketing tests, “smaller than textbook” is not bad science. It is better decision design.

If your experiment exists to drive a business action on a finite audience, Test & Roll gives a rigorous way to choose sample sizes that maximize profit instead of statistical purity.

Paper: Feit, E. M., & Berman, R. (2019). Test & Roll: Profit-Maximizing A/B Tests. SSRN: https://ssrn.com/abstract=3274875

To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Test & Roll: Why Smaller A/B Tests Can Make More Money

Machine Learning Frameworks in R

R'tichoke — Sat, 11 Apr 2026 18:30:00 +0000

[This article was first published on R'tichoke, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R’s ecosystem offers a rich selection of machine learning frameworks, each with distinct design philosophies and strengths. This post is a side-by-side comparison of five ML frameworks in R that provide unified interfaces over multiple algorithms, with runnable code examples on the same dataset so you can compare APIs directly. The focus is on packages that let you swap algorithms without rewriting your code.

Frameworks at a Glance

Feature	tidymodels	caret	mlr3	h2o	qeML
Built-in tuning	(`tune`)		(`mlr3tuning`)	(AutoML)	(`qeFT()`)
Preprocessing pipeline	(`recipes`)	(`preProcess`)	(`mlr3pipelines`)
Model variety	200+ engines	230+ methods	100+ learners	GBM, GLM, DL, DRF	20+ wrappers
Relative speed	Moderate	Moderate	Moderate	Fast (distributed)	Depends on backend
Learning curve	Medium	Low	High	Low	Very low
Active development		Maintenance mode
Best for	Production pipelines	Quick prototyping	Benchmarking	AutoML & scale	Teaching & exploration

Setup and Data

ALl examples below uses the iris classification task: predict Species from the four numeric measurements. A single train/test split is created up front so results are directly comparable.

library(dplyr)

# Reproducible train/test split (framework-agnostic)
set.seed(42)
n <- nrow(iris)
train_idx <- sample(seq_len(n), size = floor(0.7 * n))

train_data <- iris[train_idx, ]
test_data  <- iris[-train_idx, ]

# Store accuracy results for final comparison
results <- data.frame(
  Framework = character(),
  Model = character(),
  Accuracy = numeric(),
  stringsAsFactors = FALSE
)

cat("Training set:", nrow(train_data), "observations\n")

Training set: 105 observations

cat("Test set:    ", nrow(test_data), "observations\n")

Test set:     45 observations

1. tidymodels

The tidymodels ecosystem is the modern, tidyverse native approach to modeling in R. It provides a consistent grammar for specifying models (parsnip), preprocessing (recipes), composing workflows (workflows), and tuning hyperparameters (tune).

library(tidymodels)

# Define a recipe (preprocessing)
rec <- recipe(Species ~ ., data = train_data)

# Define a model specification
rf_spec <- rand_forest(trees = 500) %>%
  set_engine("ranger") %>%
  set_mode("classification")

# Combine into a workflow
rf_wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(rf_spec)

# Fit the workflow
rf_fit <- rf_wf %>% fit(data = train_data)

# Predict on test set
preds_tidy <- predict(rf_fit, test_data) %>%
  bind_cols(test_data %>% select(Species))

# Evaluate
acc_tidy <- accuracy(preds_tidy, truth = Species, estimate = .pred_class)
acc_tidy

# A tibble: 1 × 3
  .metric  .estimator .estimate
                
1 accuracy multiclass     0.978

results <- rbind(results, data.frame(
  Framework = "tidymodels",
  Model = "Random Forest (ranger)",
  Accuracy = acc_tidy$.estimate
))

Key Strengths

Composable pipeline: recipe + model + workflow is easy to extend
Swap engines with one line (set_engine("xgboost"))
Seamless cross-validation and hyperparameter tuning via tune_grid() / tune_bayes()
Deep integration with the tidyverse

2. caret

The caret package (Classification And REgression Training) was the de facto standard for ML in R for over a decade. It wraps 230+ models behind a single train() interface. While now in maintenance mode (its creator, Max Kuhn, leads tidymodels), it still remains widely used.

library(caret)

# Train a random forest with 5-fold CV
ctrl <- trainControl(method = "cv", number = 5)

rf_caret <- train(
  Species ~ .,
  data = train_data,
  method = "rf",
  trControl = ctrl,
  tuneLength = 3  # Try 3 values of mtry
)

# Best tuning parameter
rf_caret$bestTune

  mtry
1    2

# Predict on test set
preds_caret <- predict(rf_caret, test_data)

# Evaluate
cm_caret <- confusionMatrix(preds_caret, test_data$Species)
cm_caret$overall["Accuracy"]

 Accuracy 
0.9555556

results <- rbind(results, data.frame(
  Framework = "caret",
  Model = "Random Forest (rf)",
  Accuracy = as.numeric(cm_caret$overall["Accuracy"])
))

Key Strengths

Minimal API: a single train() call handles preprocessing, tuning, and fitting
230+ model methods available out of the box
Built-in confusionMatrix() with extensive diagnostics
Massive community knowledge base and Stack Overflow coverage

3. mlr3

mlr3 is a modern, object-oriented ML framework built on R6 classes. It excels at systematic benchmarking, composable pipelines, and reproducible experiments. The learning curve is steeper, but the payoff is a powerful, extensible architecture.

library(mlr3)
library(mlr3learners)

# Define the task
task <- TaskClassif$new(
  id = "iris",
  backend = train_data,
  target = "Species"
)

# Define the learner
learner <- lrn("classif.ranger", num.trees = 500)

# Train
learner$train(task)

# Predict on test data — create a test task to avoid backend storage issues
test_task <- TaskClassif$new(
  id = "iris_test",
  backend = test_data,
  target = "Species"
)
pred_mlr3 <- learner$predict(test_task)

# Evaluate
acc_mlr3 <- pred_mlr3$score(msr("classif.acc"))
acc_mlr3

classif.acc 
  0.9555556

results <- rbind(results, data.frame(
  Framework = "mlr3",
  Model = "Random Forest (ranger)",
  Accuracy = as.numeric(acc_mlr3)
))

Key Strengths

R6 object-oriented design — everything is an object with methods
First-class benchmarking: compare multiple learners on multiple tasks with benchmark()
Composable pipelines via mlr3pipelines (stacking, ensembling, feature engineering)
Built-in resampling strategies and performance measures

4. h2o (AutoML)

h2o is a distributed machine learning platform with a powerful R interface. Its standout feature is h2o.automl() automatic model selection, hyperparameter tuning, and stacked ensemble creation with a single function call. It runs on a local JVM, so Java must be installed.

Note

This section requires Java (JDK 8+) to be installed. h2o starts a local JVM-based server. If you don’t have Java, skip to the results comparison — the other four frameworks cover the same ground without this dependency.

library(h2o)

# Start a local h2o cluster (uses available cores)
h2o.init(nthreads = -1, max_mem_size = "2G")

H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    C:\Users\RIDDHI~1\AppData\Local\Temp\Rtmp8G4u9C\filec0caad7dcc/h2o_Riddhiman_Roy_started_from_r.out
    C:\Users\RIDDHI~1\AppData\Local\Temp\Rtmp8G4u9C\filec0c1660783f/h2o_Riddhiman_Roy_started_from_r.err


Starting H2O JVM and connecting:  Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 seconds 276 milliseconds 
    H2O cluster timezone:       Asia/Kolkata 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.44.0.3 
    H2O cluster version age:    2 years, 3 months and 23 days 
    H2O cluster name:           H2O_started_from_R_Riddhiman_Roy_axb153 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.98 GB 
    H2O cluster total cores:    24 
    H2O cluster allowed cores:  24 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 4.5.3 (2026-03-11 ucrt)

h2o.no_progress()  # Suppress progress bars

# Convert data to h2o frames
train_h2o <- as.h2o(train_data)
test_h2o  <- as.h2o(test_data)

# Run AutoML — automatic model selection and stacking
aml <- h2o.automl(
  x = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
  y = "Species",
  training_frame = train_h2o,
  max_models = 10,
  seed = 42
)

22:11:38.3: AutoML: XGBoost is not available; skipping it.
22:11:39.171: _min_rows param, The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 105.0.

# Leaderboard — best models ranked by cross-validated performance
h2o.get_leaderboard(aml) |> as.data.frame() |> head(5)

                                                 model_id mean_per_class_error
1    DeepLearning_grid_1_AutoML_1_20260412_221137_model_1           0.03988095
2                          GBM_2_AutoML_1_20260412_221137           0.05029762
3                          GLM_1_AutoML_1_20260412_221137           0.05029762
4    StackedEnsemble_AllModels_1_AutoML_1_20260412_221137           0.05982143
5 StackedEnsemble_BestOfFamily_1_AutoML_1_20260412_221137           0.05982143
     logloss      rmse        mse
1 0.10262590 0.1802887 0.03250400
2 0.13688347 0.1981121 0.03924839
3 0.09073184 0.1736624 0.03015862
4 0.12933104 0.2002635 0.04010548
5 0.11828660 0.1921656 0.03692762

# Predict with the best model
preds_h2o <- h2o.predict(aml@leader, test_h2o)
acc_h2o <- mean(as.vector(preds_h2o$predict) == as.vector(test_h2o$Species))
cat("Accuracy:", acc_h2o, "\n")

Accuracy: 0.9777778

results <- rbind(results, data.frame(
  Framework = "h2o",
  Model = paste0("AutoML (", aml@leader@algorithm, ")"),
  Accuracy = acc_h2o
))

# Shutdown h2o
h2o.shutdown(prompt = FALSE)

Key Strengths

h2o.automl() — fully automatic model selection, tuning, and stacked ensembles
Trains GBM, XGBoost, GLM, DRF, and deep learning models in one call
Distributed computing — scales to datasets larger than memory
Built-in leaderboard for model comparison
Production deployment via MOJO/POJO model export

5. qeML

qeML (Quick and Easy Machine Learning) takes a different approach of minimizing boilerplate. Every algorithm — random forest, gradient boosting, SVM, KNN, LASSO, neural nets, and more is wrapped behind a one-line qe*() function with a consistent (data, targetName) signature. No formula objects, no matrix conversions, no separate predict calls just results. It’s ideal for teaching, exploration, and quick model comparisons.

library(qeML)

# qeML convention: pass full data + target name (string)
# It handles train/test splitting internally via holdout
# But to match our split, we'll train on train_data and predict on test_data
# predict() expects new data WITHOUT the target column
test_features <- test_data[, -which(names(test_data) == "Species")]

# Random Forest (wraps randomForest)
rf_qe <- qeRF(train_data, "Species")
preds_rf_qe <- predict(rf_qe, test_features)
acc_rf_qe <- mean(preds_rf_qe$predClasses == test_data$Species)
cat("Random Forest accuracy:", acc_rf_qe, "\n")

Random Forest accuracy: 0.9777778

# Gradient Boosting (wraps gbm)
gb_qe <- qeGBoost(train_data, "Species")
preds_gb_qe <- predict(gb_qe, test_features)
acc_gb_qe <- mean(preds_gb_qe$predClasses == test_data$Species)
cat("Gradient Boosting accuracy:", acc_gb_qe, "\n")

Gradient Boosting accuracy: 0.9555556

# SVM (wraps e1071)
svm_qe <- qeSVM(train_data, "Species")
preds_svm_qe <- predict(svm_qe, test_features)
acc_svm_qe <- mean(preds_svm_qe$predClasses == test_data$Species)
cat("SVM accuracy:", acc_svm_qe, "\n")

SVM accuracy: 0.9555556

# Use the best-performing qeML model for the results table
best_acc_qe <- max(acc_rf_qe, acc_gb_qe, acc_svm_qe)
best_model_qe <- c("Random Forest", "Gradient Boosting", "SVM")[
  which.max(c(acc_rf_qe, acc_gb_qe, acc_svm_qe))
]

results <- rbind(results, data.frame(
  Framework = "qeML",
  Model = paste0(best_model_qe, " (qe wrapper)"),
  Accuracy = best_acc_qe
))

Key Strengths

One-line model fitting: qeRF(data, "target") — no formula, no matrix, no recipe
20+ algorithms behind a uniform qe*() interface (RF, GBM, SVM, KNN, LASSO, neural nets, and more)
qeCompare() lets you benchmark multiple methods in a single call
Built-in holdout evaluation
Lowest learning curve of any framework showcased here

Results Comparison

All five frameworks were trained on the same 70/30 split of the iris dataset. Here’s how they stack up:

library(knitr)

results <- results %>% arrange(desc(Accuracy))
kable(results, digits = 4, caption = "Test Set Accuracy by Framework")

Test Set Accuracy by Framework
Framework	Model	Accuracy
tidymodels	Random Forest (ranger)	0.9778
h2o	AutoML (deeplearning)	0.9778
qeML	Random Forest (qe wrapper)	0.9778
caret	Random Forest (rf)	0.9556
mlr3	Random Forest (ranger)	0.9556

On a clean, small dataset like iris, accuracy differences are minimal. The real differentiator is the API and workflow each framework provides. On real-world datasets the choice of framework matters more for how you structure your code than for raw accuracy.

Closing thoughts

There is no single “best” ML framework in R and the right choice depends on the task at hand:

Start with tidymodels for a modern, composable, production-ready pipeline.
Try qeML for the fastest path from data to results.
Use h2o for automatic model selection and stacking with minimal effort.
Consider mlr3 for rigorous benchmarking and advanced pipeline composition.
Stick with caret if maintaining existing code or prefer its battle-tested simplicity.

To leave a comment for the author, please follow the link and comment on their blog: R'tichoke.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Machine Learning Frameworks in R

New R Package {bdlnm} Released on CRAN: Bayesian Distributed Lag Non-Linear Models in R via INLA

Pau Satorra — Fri, 10 Apr 2026 17:14:30 +0000

[social4i size="small" align="align-left"] -->

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

CRAN, GitHub

TL;DR: {bdlnm} brings Bayesian Distributed Lag Non-Linear Models (B-DLNMs) to R using INLA, allowing to model complex DLNMs, quantify uncertainty, and produce rich visualizations.

Background

Climate change is increasing exposure to extreme environmental conditions such as heatwaves and air pollution. However, these exposures rarely have immediate effects. For example:

- A heatwave today may increase mortality several days later
- Air pollution can have cumulative and delayed impacts

Distributed Lag Non-Linear Models (DLNMs) are the standard framework for studying these effects. They simultaneously model:

- How risk changes with exposure level (exposure-response)
- How risk evolve over time (lag-response)

Usually in the presence of non-linear effects, splines are used to define these two relationships. These two basis are then combined through a cross-basis function.

As datasets become larger and more complex (e.g., studies with different regions and longer time periods), classical approaches show limitations. Bayesian DLNMs extend this framework by:

- Supporting more flexible model structures
- Providing full posterior distributions
- Enabling richer uncertainty quantification

The new {bdlnm} package extends the framework of the {dlnm} package to a Bayesian setting, using Integrated Nested Laplace Approximation (INLA), a fast alternative to MCMC for Bayesian inference.

Installing and loading the package

As of March 2026, the package is available on CRAN:

install.packages("bdlnm")
library(bdlnm)

At least the stable version of INLA 23.4.24 (or newest) must be installed beforehand. You can install the newest stable INLA version by:

install.packages(
  "INLA",
  repos = c(
    getOption("repos"),
    INLA = "https://inla.r-inla-download.org/R/stable"
  ),
  dep = TRUE
)

Now, let’s load all the libraries we will need for this short tutorial:

Load required libraries

TheseusPlot 0.2.0: Visualizing Decomposition of Differences in Rate Metrics

Koji Makiyama — Fri, 10 Apr 2026 00:00:00 +0000

[This article was first published on HOXO-M Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

TheseusPlot is an R package that decomposes differences in rate metrics between two groups into contributions from individual subgroups and visualizes the results as a “Theseus Plot”.

The package is inspired by the Ship of Theseus thought experiment. It replaces subgroup data step by step, recalculates the overall metric at each step, and interprets each change as that subgroup’s contribution to the overall difference.

Suppose you notice that the click-through rate is lower in 2025 than in 2024 and want to examine how a particular attribute, such as gender, contributed to the change. If you obtain a Theseus Plot like the one below, it suggests that men contributed more to the decline in click-through rate than women.

What’s new in 0.2.0

Version 0.2.0 includes the following changes:

a fix for continuous-variable discretization with split = "rate", where bin boundaries for the second group could previously be computed from the first group’s data
a fix for the size bar of "Sum of ... other attributes", which could incorrectly use the first group’s counts for both groups
a fix for warnings in plot() and plot_flip() when multiple subgroups were tied for the largest absolute contribution
suppression of warnings generated internally by waterfalls::waterfall() during plot creation.

I would like to thank Kazuyuki Sano for reporting the first two issues and contributing to their fixes.

Installation

You can install TheseusPlot from CRAN with:

install.packages("TheseusPlot")

Try it out

TheseusPlot may be useful when you want to understand why metrics such as conversion rate, retention rate, or click-through rate changed.

For details on how to use it, please see the package website: https://hoxo-m.github.io/TheseusPlot/.

To leave a comment for the author, please follow the link and comment on their blog: HOXO-M Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: TheseusPlot 0.2.0: Visualizing Decomposition of Differences in Rate Metrics

Using R to Teach R: Lessons for Software Development

The Jumping Rivers Blog — Thu, 09 Apr 2026 23:59:00 +0000

[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As we approach the decennial (10-year) anniversary since Jumping Rivers was founded in 2016, it’s a good time to reflect on what we have achieved in that time and share some lessons learned.

If you have read our blogs previously then you will be aware that Jumping Rivers is a consultancy and training provider in all things data science. But did you know that we offer over 50 different courses spanning R, Python, Git, SQL and more?

In this blog we will provide a glimpse into our internal process and share how we have streamlined the task of maintaining so many courses. Along the way we will share some good practices applicable to any big coding project, including packaging of source code and automated CI/CD.

The challenge

Let’s start by laying out the key challenges which face us.

1. Multilingual support

Our course catalogue consists of over 50 courses. The majority of these are either based on R or Python or both:

50% R
30% Python
5% R and Python
15% other (Git, SQL, Tableau, Posit and more)

At the very least, any solution that we come up with for standardising our courses must be compatible with both R and Python. Ideally it should also support some less taught languages including SQL and Git.

2. Maintenance

The world of R and Python is constantly changing. The languages themselves receive frequent updates, as do publicly available R packages on CRAN and Python packages on PyPI.

This has the consequence that code which worked one year ago (or even one day) may no longer be functional with the latest package versions. We will need some way to track this and ensure that the code examples covered in our courses remain relevant and error-free.

3. Demand

We deliver over 100 courses per year. For a relatively small team of data scientists, this can be a lot to juggle!

In an ideal world, the process of building the course materials, setting up the cloud environment for training, and managing all of the administration that goes along with this should be automated. That way, the trainer can focus on providing the highest quality experience for the attendees without having to worry about things going wrong on the day.

The solution

Our team is used to setting up data science workflows for clients, including automated reporting and migration of source code into packages. We have therefore applied these techniques in our internal processes, including training.

Automated reporting

You write a document which has to be updated on a regular basis; this might include a monthly presentation showing the latest company revenues. Does this scenario sound familiar?

We could regenerate the plots and data tables and manually copy and paste these into the report document. Even better, we can take advantage of free-to-use automated reporting frameworks including R Markdown and Quarto.

R Markdown and Quarto both work as follows:

We provide a “YAML header” at the top of the report document with configuration and formatting options:
```
---
title: "Introduction to Python"
authors:
- "Myles Mitchell"
date: "2026-04-02"
output: pdf
---
```

The report body is formatted as Markdown and supports a mixture of plain text and code:

## Introduction
At it's most basic, Python is essentially a calculator.
We can run basic calculations as follows:
```{python}
2 + 1
```
We can also assign the output of a calculation to a
variable so that it can be reused later:
```{python}
x = 2 + 1
print(x)
```

Notice that we have included chunks of Python code. By making use of chunk options we can configure code chunks to be executed when rendering the report. Any outputs from the code (plots, tables, summary statistics) can then be displayed.

By migrating the code logic into the report itself, we can update our report assets at the click of a button whenever the data changes.

We have taken inspiration from this approach with our course notes and presentation slides. This forces us to be rigorous with the code examples. Any runtime errors that are produced by faulty or outdated code would be visible in the course notes and by extension to the attendees of our courses.

Crucially for us, R Markdown and Quarto are both compatible with R and Python. They also support syntax highlighting for languages like Git and SQL, as well as a variety of output formats including HTML and PDF.

Internal R packages

So we have settled on a solution for building our course notes. But we have 50 different courses, and setting these up from scratch each time is going to get tedious!

A good practice in any coding project is to avoid duplication as much as possible. Instead of copying and pasting code, we should really be migrating code into functions which are self contained, reusable and easy to test. This will mean fewer places to debug when things inevitably go wrong.

Following a similar philosophy for our training infrastructure, we have migrated any reusable assets for our courses—including logos, template files and styling—into a collection of internal R packages.

When building a new course, the developer can now focus on the aspects that are unique to that course:

Code examples
Notes
Exercises
Presentation slides

Everything else is taken care of automatically:

The appearance of the course notes and presentation slides.
Build routines including converting the R Markdown / Quarto text files into HTML.

In addition to course templates, we also have internal packages for managing the administrative side of training, including:

Calculating pricing quotes for clients.
Generating post-course certificates.
Spinning up a bespoke Posit Workbench environment for the course.
Summarising attendee feedback.

And the list goes on!

GitLab CI/CD

With automated reporting and packaging of source code, we have created standardised routines that can be applied to any of our courses.

This does not change the fact that we have over 50 courses to maintain. We still need a way of testing our courses and tracking issues. This is where CI/CD (Continuous Integration / Continuous Development and Deployment) comes in.

CI/CD defines a framework for software development, including:

Automated unit testing.
Branching of source code and code review.
Versioning and deployment of software.

If you maintain software then you have likely come across version control with Git. Cloud platforms like GitLab and GitHub provide tools for collaborative code development. Not only do they provide a cloud backup of your source code, they also provide the following features:

CI/CD tools for automated testing, build and deployment.
Branch rules for enforcing good practices like code review and unit testing.
Versioning and tagging of source code.

Each of our courses is maintained via it’s own GitLab repository. The CI/CD pipelines for our courses are defined in a separate repository along with the internal R packages mentioned above.

When setting up a new course, the course repository will be automatically populated with the template CI/CD rules. All courses are therefore subject to the same stringent checks, including:

Ensuring that the course notes build without errors.
Enforcing code review of any course updates before these are merged into the main branch.
Building and storing the artifacts (the rendered HTML notes and coding scripts) for the latest version of the course.

These checks are triggered by any updates to a course. We also schedule monthly CI/CD pipelines for all courses, with any issues immediately flagged to our trainers.

We have also taken advantage of GitLab’s folder-like structure for organising code repositories. Within the Jumping Rivers project on GitLab, we have a subproject called “training”. All of our course-related repositories are located “downstream” from this project. This means that any settings or environment variables defined at the “training” level are automatically applied to all of our courses.

In summary

The take-home lessons from this blog are applicable to any big coding project:

Avoid duplication: migrate any reusable logic or assets into standalone packages.
Utilise CI/CD workflows using GitLab, GitHub or similar.
Focus on what matters by automating as much of the process as possible.

Our training infrastructure has taken 10 years to build and is still constantly evolving; we have not even covered the full process in this blog! For a deeper dive, check out this talk by Myles at SatRdays London 2024.

For more on automated reporting, check out:

For more on packaging of source code, check out:

Writing a personal R package.
Three-part series: Creating a Python package.
Four-part series: R package quality.

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Using R to Teach R: Lessons for Software Development

Hold On Hope: publication lag times at cell biology journals

Stephen Royle — Thu, 09 Apr 2026 11:22:53 +0000

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve posted about publication lag times previously. The “lag” refers to the time from submitting a paper and it appearing in a journal.

Publication lag times are still a frustration for researchers. Although preprints circumvent the delay in sharing science with others, publication is still king when it comes to evaluation. Contracts are short and publication delays can be long…

I recently saw a post comparing median publication lag times for genetics journals. This motivated me to update my code and rerun the analysis for cell biology journals to see what, if anything, has changed.

Methodology

I wrote an R package PubMedLagR which uses {rentrez} to retrieve the publication data from PubMed. Once we have this data, it is a matter of producing some plots which I have covered previously. To ensure that we are only looking at research papers, we use "journal article"[pt] as a search term, and also remove from the data anything else (Reviews, Commentaries etc.). Feel free to use it to look at other journals.

Caveats

Before we get started, there are some caveats. The analysis is only as good as the PubMed data. Not all journals submit their date information to PubMed, while for others it is incomplete (as we’ll see below). There are inaccuracies for sure. I found a paper that was supposedly submitted on 1970-01-01, more than 40 years before the journal started. Also, it’s well known that some journals “restart the clock” on a paper by rejecting it and allowing resubmission, then only counting the resubmitted version. So, comparison between journals is a little tricky, but it allows us to look at trends.

Let’s dive in

We can use the following code to grab the data we are interested in.

library(PubMedLagR)
jrnl_list <- c("J Cell Sci", "Mol Biol Cell", "J Cell Biol", "Nat Cell Biol",
               "EMBO J", "Biochem J", "Dev Cell", "FASEB J", "J Biol Chem",
               "Cells", "Front Cell Dev Biol", "Nature Communications",
               "Cell Reports", "Mol Cell", "Autophagy", "Cell Death Differ",
               "Cell Death Dis", "Cell Res", "Sci Adv", "Cell")
yrs <- 2006:2026
retrieve_journal_year_records(jrnl_list, yrs, batch_size = 250)
pprs <- pubmed_xmls_to_df()

The list of journals is somewhat arbitrary. I have included Nature Communications and Science Advances although they carry many other papers besides cell biology. I included Cell (even though there’s not much cell biology in there these days) and left out Nature and Science.

How many papers are in each of these journals and how has that changed over time?

There’s a huge increase in the number of papers published by Nature Communications, Science Advances and Cell Reports. Nature Communications is a behemoth, publishing ~12,500 papers in 2025.

There was a boom and bust in publications at Front Cell Dev Biol and Cells which could be due to reputational problems (like this). Other journals have declined. Most noticeably J Biol Chem, but others have taken a hit. The reasons behind these dynamics are discussed elsewhere.

Median publication lag time

We’ll use received-to-published as our measure of publication lag time. This is the time from submission to it appearing in the journal “in print”. It’s measured here in days.

For the last 5 years, the median lag time at Nature Cell Biology is over one year. Other outlets are close to this, e.g. Cell, Dev Cell; whereas others linger around 200 days, or lower in the case of J Cell Sci, FASEB J et al. The journal Autophagy is missing here because there are no data for it. Others, like Sci Adv, MBoC have only minimal data available. The shortest lag times were for Cells (which might not surprise some people).

The lion’s share of this lag time is the time from submission to acceptance (received-accepted), with a small contribution from the time taken to formally publish the article (accepted-published).

Journal	Year	Accepted-Published	Recieved-Published	Recieved-Accepted
Biochem J	2025	6	125	116
Cell	2025	29	308.5	275.5
Cell Death Differ	2025	13	235	221
Cell Death Dis	2025	18	218	193
Cell Rep	2025	23	233	209
Cell Res	2025	34	208.5	172
Cells	2025	16	57	41
Dev Cell	2025	27	322	296
EMBO J	2025	27	222	196
FASEB J	2025	13	131	117
Front Cell Dev Biol	2025	35	111	71
J Biol Chem	2025	9	128	117
J Cell Biol	2025	35	259	218
J Cell Sci	2025	13	168.5	155
Mol Cell	2025	27	258	231
Nat Cell Biol	2025	48	381.5	327
Nat Commun	2025	20	266	241

This is using the median time. Obviously, some papers whizz straight in, whereas others… don’t. Let’s have a look.

You can see many examples of papers that have lag times of 1, 2 or 3 years. I scaled all the plots to a maximum lag time of 3 years. There were several examples of papers with lag times of up to 7 years that looked to be genuine, but they distorted the view.

So what trends can we see?

The lag times are creeping up at some journals, but not at others. It’s possible to see some very short lag times (in amongst the longer ones) recently at Dev Cell, Mol Cell and EMBO J. I assume these are transfers in to the journal, leading to rapid publication. Although Cell also has many of those too, so perhaps there are other explanations.

To click though in more detail, I made some graphics for each journal which use ridgelines to look at the profile of lag times for each year.

And finally

Incidentally, nine of the top ten papers with the longest lag times were published in Nature Communications. The longest was this one (3263 days). Almost nine years! All papers have their battle stories and I’m sure this one has a tale to tell.

—

The post title comes from “Hold On Hope” by Guided by Voices.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Hold On Hope: publication lag times at cell biology journals

Developer Engagement and Bioconductor

Nicholas Cooley, PhD — Thu, 09 Apr 2026 00:00:00 +0000

[This article was first published on Bioconductor community blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

During the Chan Zuckerberg Institute’s Essential Open Source Software for Science cycle 6 funding round, the Bioconductor Community Manager, Maria Doyle, secured a grant to fund a developer engagement position for Bioconductor, and I was fortunate enough to be offered that role. I am Nick Cooley, and I’m excited to see what this role can bring to Bioconductor. My background is relatively diverse, I received my PhD in organic chemistry from the University of Missouri, and I worked on prokaryotic genomics and functional genomics at the University of Pittsburgh from 2017 to 2025.

Role responsibilities

The mandate of this role is somewhat broad. Bioconductor, and academic computing generally face a myriad of distinct and interrelated challenges as hardware, computing paradigms, and education environments change rapidly. Improving developer resources for tackling new and existing challenges, modernizing Bioconductor developer onboarding materials (particularly for early career researchers), and improving recognition mechanisms for community members who volunteer time and effort to the Bioconductor project are all general themes within the role scope.

Some Specific Efforts

A few of the specific efforts I’ll be working on in this role include:

Developer Forum

The Developer Forum had previously been run on a volunteer basis, and served as a community resource for discussing technical and infrastructure issues, concerns, and opportunities. The creation of the Developer Engagement Lead allowed us include the Forum as direct responsibility of this role.

Developer Champions Program

Bioconductor working groups have been a pillar of Bioconductor for a while, and represent a considerable amount of volunteer work towards the project. Improving the visibility of the working groups themselves, and the recognition that project contributors receive for their participation in the working groups can go a long way towards ensuring that that work is valued by contributors home institutions and funding mechanisms. The Champions Program aims to create a clear recognition mechanism for those volunteer efforts.

Bioconductor hackathon events

Community and collaboration are irreplaceable engines of strong research. Many Bioconductor contributors find community and collaboration within their own disciplines or institutions. Providing an avenue for collaborative and technical events within Bioconductor can fill persistent gaps in the the research tooling present in the project, and present networking opportunities for early career researchers. Part of this role is planning and running these events.

Bioconductor documentation and LLMs

The ways that researchers search for information, tools, and workflow examples are changing with the rise of large language models and their interfaces. There are opportunities for improving how bioinformaticians, especially those outside of the Bioconductor community, find and familiarize themselves with research solutions within the Bioconductor project, including through improvements to website search and documentation discoverability. A long term goal of this role is to work on documentation templates and checking tools to improve their searchability by LLMs, and explore the feasibility of Bioconductor sanctioned and managed LLMs.

How to get in touch

For developer discussions and ideas, the Bioconductor Zulip is the best place to connect.

To leave a comment for the author, please follow the link and comment on their blog: Bioconductor community blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Developer Engagement and Bioconductor

Collaborating between Bioconductor and R-universe on Development of Common Infrastructure

The rOpenSci Team — Wed, 08 Apr 2026 00:00:00 +0000

[This article was first published on Bioconductor community blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This article is cross-posted on rOpenSci and R-Consortium blogs.

For more than two decades, the Bioconductor project has been a cornerstone of the R ecosystem, providing high-quality, peer-reviewed tools for bioinformatics and computational biology. Its curated repository model, rigorous review standards, and tightly coordinated release process have helped establish Bioconductor as one of the most trusted distribution channels in scientific computing.

However, the infrastructure that supports such a long-standing and large-scale project inevitably accumulates technical debt. Legacy build systems, bespoke tooling, and historically grown workflows add up to costly and unsustainable maintenance work. For this reason, Bioconductor is collaborating with R-universe to gradually modernize parts of its infrastructure, while accommodating the project’s scale, governance, and established processes. In turn, Bioconductor is helping R-universe expand and refine its features as we learn to serve the complex needs of the Bioconductor community.

This collaboration reflects a core principle of R-universe as an R Consortium Infrastructure Steering Committee (ISC) top-level project: supporting reviewed package repositories such as rOpenSci and Bioconductor, and providing modern, open, and reusable infrastructure that strengthens the broader R ecosystem.

A Shared Mission: Tooling for Managed Repositories

R-universe was designed as a next-generation package distribution and build system for R. It provides:

Continuous building and checking of R packages across platforms
Binary packages for Windows, macOS, Linux, and WebAssembly
Transparent and reproducible build environments managed via GitHub actions
Dashboards and metadata APIs for monitoring ecosystem health and activity
CRAN-like package repositories with discoverable metrics and documentation

From the outset, a key objective has been to support curated and reviewed communities — such as rOpenSci and Bioconductor — by offering modern infrastructure without requiring them to redesign their governance model or review processes.

For Bioconductor, this means incrementally introducing piece-wise functionality, with consideration for established release cycles and quality control mechanisms:

Setting up independent build and dashboard tooling, replicating processes from the current Bioconductor build systems on R-universe infrastructure
Mirroring Windows and macOS binaries produced on R-universe to Bioconductor
Exploring further integration of results and metadata produced by R-universe for Bioconductor health/activity monitoring and aiding the curation processes
Potential future steps toward deeper automation and harmonization

By taking small gradual steps towards adopting R-universe components, everyone gets the opportunity to experiment with new tooling and evaluate where adjustments may be needed in order to minimize disruption to existing practices.

An important milestone in this venture is that Bioconductor now uses R-universe to build the Windows and macOS binaries, which significantly reduces costs and the maintenance load on the Bioconductor team. Beyond binary distribution, we are currently exploring deeper integration of R-universe’s continuous check results into Bioconductor’s quality control and release processes.

Two Universes: Release and Development

Bioconductor maintains two distinct repositories:

A release branch for stable packages
A devel branch for ongoing development and the next release cycle

To mirror this structure, we currently operate two dedicated R-universe instances:

Development branch: https://bioc.r-universe.dev
Release branch: https://bioc-release.r-universe.dev

These universes integrate directly with Bioconductor’s existing Git infrastructure and provide continuous builds for packages in both branches.

Through the R-universe dashboard, package maintainers and users can:

Inspect cross-platform check results
Review extended BiocCheck diagnostics
Monitor build logs and dependency graphs
Explore rich package metadata and metrics
Publish binary packages for Windows, macOS, and Linux

This provides a familiar yet modern interface for Bioconductor contributors, aligned with what users increasingly expect from contemporary R package infrastructure.

Information about each package is available on https://bioc.r-universe.dev/{pkgname}. For example, https://bioc.r-universe.dev/DESeq2 provides details on the DESeq2 package as shown below:

screenshot of r-universe

If this is your first time visiting R-universe, we recommend clicking the “Website Tour” button which will walk you through the most important information in 1 or 2 minutes.

Technical Documentation for Bioconductor Maintainers

The R-universe project maintains comprehensive technical documentation at https://docs.r-universe.dev. For Bioconductor specifically, we created a dedicated section summarizing the most relevant topics for developers to get started with R-universe: https://docs.r-universe.dev/bioconductor/

As the collaboration evolves and new components get introduced, the documentation will continue to be expanded. The goal is to provide Bioconductor maintainers with a clear reference point for understanding how R-universe fits into their development workflow, while maintaining compatibility with the established practices that have made Bioconductor a successful project within the R community.

Looking Ahead

Adopting new infrastructure inevitably involves adjustments. For Bioconductor developers, integrating with a new build and distribution system will likely require some changes to workflows, and time to become familiar with new or different package checks, build diagnostics, and binary distribution.

However, by gradually moving toward common infrastructure, the Bioconductor project will benefit from improvements that are being continuously developed and maintained for the broader R ecosystem. A system based on modern continuous integration (CI) will provide developers with improved tooling, and will give the core team more time to focus on community coordination and quality control, rather than on maintaining costly infrastructure. At the same time, the shared platform provided by R-universe can help to increase the visibility and accessibility of Bioconductor software to the greater R community.

We look forward to continuing this alliance and to working with the Bioconductor community to ensure that the next generation of infrastructure supports the project for many years to come.

To leave a comment for the author, please follow the link and comment on their blog: Bioconductor community blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Collaborating between Bioconductor and R-universe on Development of Common Infrastructure

Collaborating between Bioconductor and R-universe on Development of Common Infrastructure

rOpenSci — Wed, 08 Apr 2026 00:00:00 +0000

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A shared mission: Tooling for managed repositories

R-universe was designed as a next-generation package distribution and build system for R. It provides:

Continuous building and checking of R packages across platforms
Binary packages for Windows, macOS, Linux, and WebAssembly
Transparent and reproducible build environments managed via GitHub actions
Dashboards and metadata APIs for monitoring ecosystem health and activity
CRAN-like package repositories with discoverable metrics and documentation

For Bioconductor, this means incrementally introducing piece-wise functionality, with consideration for established release cycles and quality control mechanisms:

Setting up independent build and dashboard tooling, replicating processes from the current Bioconductor build systems on R-universe infrastructure
Mirroring Windows and macOS binaries produced on R-universe to Bioconductor
Exploring further integration of results and metadata produced by R-universe for Bioconductor health/activity monitoring and aiding the curation processes
Potential future steps toward deeper automation and harmonization

Two Universes: Release and Development

Bioconductor maintains two distinct repositories:

A release branch for stable packages
A devel branch for ongoing development and the next release cycle

To mirror this structure, we currently operate two dedicated R-universe instances:

Development branch: https://bioc.r-universe.dev
Release branch: https://bioc-release.r-universe.dev

These universes integrate directly with Bioconductor’s existing Git infrastructure and provide continuous builds for packages in both branches.

Through the R-universe dashboard, package maintainers and users can:

Inspect cross-platform check results
Review extended BiocCheck diagnostics
Monitor build logs and dependency graphs
Explore rich package metadata and metrics
Publish binary packages for Windows, macOS, and Linux

This provides a familiar yet modern interface for Bioconductor contributors, aligned with what users increasingly expect from contemporary R package infrastructure.

Information about each package is available on https://bioc.r-universe.dev/{pkgname}. For example, https://bioc.r-universe.dev/DESeq2 provides details on the DESeq2 package as shown below:

If this is your first time visiting R-universe, we recommend clicking the “Website Tour” button which will walk you through the most important information in 1 or 2 minutes.

Technical Documentation for Bioconductor Maintainers

Looking Ahead

We look forward to continuing this alliance and to working with the Bioconductor community to ensure that the next generation of infrastructure supports the project for many years to come.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Collaborating between Bioconductor and R-universe on Development of Common Infrastructure

EM-DAT, the world’s disaster memory, is at risk

R on Stats and R — Tue, 07 Apr 2026 00:00:00 +0000

[This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I do not usually write posts that are calls to action. But sometimes, something important enough comes along that it would feel wrong to stay silent. This is one of those times.

What is EM-DAT?

EM-DAT, the Emergency Events Database, is the world’s most widely used and trusted global database for tracking natural and technological disasters. It has been maintained since 1988 by the Centre for Research on the Epidemiology of Disasters (CRED), which is part of UCLouvain.

The database currently contains data on the occurrence and impacts of over 27,000 mass disasters worldwide, from 1900 to the present day. It covers floods, storms, earthquakes, droughts, wildfires, extreme temperatures, landslides, volcanic activity, and technological accidents, across virtually every country on earth.

Crucially, it is:

Open access (for non-commercial use)
Globally comparable, using transparent and consistent inclusion criteria
Cross-verified across multiple sources (UN agencies, NGOs, reinsurance companies, research institutes, press agencies)
The reference dataset for thousands of peer-reviewed studies, national risk assessments, and international policy processes

If you have ever read a paper or report about global disaster trends, the probability is high that EM-DAT was the data source behind it.

Why is it at risk?

For more than 25 years, EM-DAT was primarily funded by the United States Agency for International Development (USAID). Following the recent dismantling of USAID, that funding is gone, and no sustainable alternative has been secured.

This is not a minor budget shortfall. Without a replacement funding mechanism, EM-DAT risks shutting down entirely.

Why does it matter?

The open letter drafted in support of EM-DAT puts it well: in an era of intensifying climate extremes, cascading risks, and compounding crises, reliable data are not a luxury. They are the infrastructure for informed decision-making.

Concretely, EM-DAT underpins:

Disaster risk reduction and prevention policies, used by governments to assess national risks and prioritise investments
Humanitarian operations, relied upon by multilateral agencies and NGOs to plan and forecast needs
Climate research, providing historical baselines for understanding trends in extreme weather events
Monitoring of global commitments, such as the Sendai Framework for Disaster Risk Reduction, the SDGs, and the Paris Agreement
Insurance and risk modelling, used by the private sector alongside other data to benchmark losses and refine exposure models

EM-DAT’s value is not just in the quantity of records. It lies in the rigour and consistency of its methodology over time and across countries. That is exactly what makes it irreplaceable. In a world awash with data, curated and quality-controlled datasets of this kind are rare. If EM-DAT were to close, the result would not be a smooth substitution. It would be fragmentation, proprietary data silos, and reduced access, particularly for lower-income countries that are already under-represented in global evidence.

A personal note

I signed the open letter after being informed of the issue by my colleague Prof. Niko Speybroeck, a leading epidemiologist at UCLouvain and program director of the CRED.

I do not have direct expertise in disaster epidemiology. But I do care about open data, open science, and the integrity of global research infrastructure. And EM-DAT is exactly the kind of resource that the whole scientific community relies on, often without fully realising it.

How you can help

If you share these values, I encourage you to sign the open letter: “The World’s collective disaster memory must be preserved”.

The letter calls on governments, multilateral development banks, philanthropic foundations, and international organisations to step forward with a coordinated and sustainable funding arrangement for EM-DAT. The cost of maintaining the world’s primary disaster database is modest set against the billions spent on disaster response and recovery each year. The cost of losing it would be profound.

Please also consider sharing this post or the open letter with your own network (researchers, policymakers, students, practitioners, or anyone who cares about data-driven approaches to global challenges).

More information

EM-DAT website: https://www.emdat.be/
Open letter: https://openletter.earth/the-worlds-collective-disaster-memory-must-be-preserved-66c88c44
CRED at UCLouvain: https://www.uclouvain.be/en/research-institutes/irss/cred-epidemiology-of-disasters

As always, if you have any thoughts or questions related to this post, feel free to leave a comment below.

To leave a comment for the author, please follow the link and comment on their blog: R on Stats and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: EM-DAT, the world’s disaster memory, is at risk

Marathon Man: how to pace a marathon

Stephen Royle — Mon, 06 Apr 2026 13:36:35 +0000

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How does the average marathoner pace their race? In this post, we’ll use R to have a look at a large dataset of marathon times to try to answer this question.

The ideal strategy would be to “even split” the race. This is where you run continually at the same pace from kilometre 0 to the finish. Let’s forget about “negative splitting”. This is where you speed up through the race, usually by running at a constant pace for the first half or three-quarters and then increasing the pace. Negative splits are for the pros not mere mortals! The difficulty with even-splitting the race is that it is very hard to know what pace you can maintain. The marathon gets hard for everyone after 30 km, so a slow down is almost inevitable. Certainly if you have started too fast you will fade. This situation is known as “positive splitting”.

Why is it so hard to know what pace you can maintain? Well, you can predict a pace based on existing races e.g. half marathon, and there are various ways to do this, but it is difficult to tell if you can hold that pace for the marathon. It’s such a brutal event that training up to run one takes time and it equally takes a while to recover, so experimentation is limited. Running a full marathon (at pace) in training, is not advised. So determining an ideal pace involves quite a bit of guesswork.

Let’s take a look at a big dataset of marathon times – we’ll use the New York City Marathon from 2025 – to see if we can understand how to pace a marathon. There’s an available dataset of chip times (meaning we don’t have to worry about dodgy GPS data) and the course has similar first and second half profiles, allowing us to use these times to understand negative/even/positive splitting. Let’s dive in.

You can skip to the code to play along or just see the analysis here.

First we can see using histograms of the difference between second half and first half of the marathon, that most runners positive split the marathon. There are very few runners who run a negative-split (blue bars, left of the dashed line). More runners even-split (yellow), but the majority run positive (red) split times.

For marathoners with finishing in times of below 3 h, the modal split is only +2 minutes. Over 21.1 km this is only a loss of 6 s per km. For marathoners with finishes of over three hours, this loss gets more severe. Those finishing outside of 5 h, ship 20 minutes or more in the second half.

At first glance this looks like better pace management by the faster runners, but these positive splits could be proportional to the paces being run. In other words, a slower runner should ship more time in the second half, because they’re running more slowly.

We can look at this data a different way and directly compare the first and second half times for each runner. Again this highlights just how few runners negative- or even-split the marathon. Most are positive splitting and are in the upper left half of the plot. We can also see that the data veers away from the ideal even-split (dashed line) with the slower paces. This veering looks linear (straight line).

We can fit a line to this data, and constrain it to go through (1,1) i.e. a 2 h marathoner even-splitting the race. To do this in R we can use lm(formula = I(y - 60) ~ I(x - 60) + 0, data = fitting) and this gives the coefficient for I(x – 60) as 1.24. This is essentially the fade co-efficient for the average runner in the 2025 edition of this race.

What does that mean? Well, for a runner achieving a 90 minute first half, their second half would most likely be: 60 + 1.239 * (90 – 60) = 97.17 minutes, so this would be a finish time of 3:07:10.

For anyone looking to run a 3 h New York Marathon, the average runner would therefore need to run 60 / 2.239 + 60 = 86.8 minutes for the first half to anticipate the fade. So 1:26:48 for the first half, and then 1:33:12 for the second half.

A more simple calculation is to take the mean of the ratio between the two half times for everyone in the dataset. This gives a fade coefficient of 1.13. The difference between these two fade co-efficients is due to the lack of constraint used in the fit. The ratio predicts a positive split being inevitable for the fastest runners, which is probably not true. Anyhow, this puts the first half time at 88 minutes for folks looking to run 3 h. These fade co-efficients are good predictors for a range of times, and I suspect would be similar at other marathon events with a similar profile. You can use them to calculate your ideal pace for a target finish time.

Finally, for the most accurate answer about sub-3 h pacing, we can look directly at runners finishing between 02:50:00 and 03:00:00 and see what they actually ran. The median first half time was 86.3 min (IQR = 84.4 – 87.87) and the second half was 89.62 (88.07 – 91.12). This gives a median finish time of 2:56:00. So running a 1:26:18 first half would give someone their best chance of finishing in under 3 h, allowing for the inevitable fade.

The takeaway message is: to finish within a goal time, do not assume even splits. That is, if you want to run 3 hours 30 min and bank on 90 minutes per half (4:59/km), you will most likely fail to hit the target. Build in a buffer of time to allow for the inevitable fade. A pace of 4:45/km is a better target pace (see below).

Good luck!

Finish Time	Even split pace	Target pace
03:00:00	00:04:16	00:04:07
03:30:00	00:04:59	00:04:45
04:00:00	00:05:41	00:05:23
04:30:00	00:06:24	00:06:01
05:00:00	00:07:07	00:06:39
06:00:00	00:08:32	00:07:55

The code

This analysis was possible thanks to the uploader for making the chip time data available. Also, a shoutout to Nicola Rennie for sharing how to style social media handles in {ggplot2} graphics. This part of my code requires my {qBrand} library and should be skipped if you are running the code yourself (remove the caption = cap argument in the ggplot calls).

library(ggplot2)
library(ggtext)

sysfonts::font_add_google("Roboto", "roboto")
showtext::showtext_auto()

## data wrangling ----

# load csv file from url
url <- paste0("https://huggingface.co/datasets/donaldye8812/",
              "nyc-2025-marathon-splits/resolve/main/",
              "nyrr_marathon_2025_summary_56480_runners_WITH_SPLITS.csv")
df <- read.csv(url)

# the data frame is a long table
# we need to grab the time values where splitCode is "HALF" or "MAR"
df <- df[df$splitCode %in% c("HALF", "MAR"), c("RunnerID", "splitCode", "time")]
# reshape to wide format, values are in time
df <- reshape(df, idvar = "RunnerID", timevar = "splitCode", direction = "wide")
# calculate the split times in minutes
df$split_HALF <- as.numeric(
  as.difftime(df$time.HALF, format = "%H:%M:%S", units = "mins"))
df$split_MAR <- as.numeric(
  as.difftime(df$time.MAR, format = "%H:%M:%S", units = "mins"))
# calculate the second half time
df$split_SECOND_HALF <- df$split_MAR - df$split_HALF
# remove rows with NA values
df <- df[!is.na(df$split_SECOND_HALF), ]
# calculate the difference
df$Difference <- df$split_SECOND_HALF - df$split_HALF
# difference as a fraction of first half
df$Difference_Fraction <- df$Difference / df$split_HALF * 100
# classify into sub 3 hr, sub 4 hr, sub 5 hr, sub 6 hr, over 6 hr
df$Category <- cut(df$split_MAR,
                           breaks = c(0, 180, 210, 240, 300, Inf),
                           labels = c("Sub 3 h", "3:00-3:30", "3:30-4:00",
                                      "4:00-5:00", "Over 5 h"))

## plot styling ----

social <- qBrand::qSocial()
cap <-  paste0(
  "**Data:** New York City Marathon 2025 Results
**Graphic:** ",social
)

my_palette <- c("Sub 3 h" = "#cb2029",
                "3:00-3:30" = "#147f77",
                "3:30-4:00" = "#cf6d21",
                "4:00-5:00" = "#28a91b",
                "Over 5 h" = "#a31a6d")

## make the plots ----

ggplot(df, aes(x = Difference, fill = after_stat(x))) +
  # vertical line at x = 0
  geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
  geom_histogram(breaks = seq(
    from = -59.5, to = 81.5, by = 1), color = "black") +
  scale_colour_gradient2(
    low = "#2b83ba",
    mid = "#ffffbf",
    high = "#d7191c",
    midpoint = 0,
    limits = c(-15,15),
    na.value = "#ffffffff",
    guide = "colourbar",
    aesthetics = "fill",
    oob = scales::squish
  ) +
  scale_x_continuous(breaks = seq(-45,90,15), limits = c(-40, 80)) +
  facet_wrap(~ Category, ncol = 1, scales = "free_y") +
  labs(caption = cap) +
  labs(title = "Most runners positive split the marathon",
       x = "Difference in minutes (Second Half - First Half)",
       y = "Number of Runners",
       caption = cap) +
  theme_classic() +
  # hide legend
  theme(legend.position = "none") +
  theme(
    plot.caption = element_textbox_simple(
      colour = "grey25",
      hjust = 0,
      halign = 0,
      margin = margin(b = 0, t = 5),
      size = rel(0.9)
    ),
    text = element_text(family = "roboto", size = 16),
    plot.title = element_text(size = rel(1.2),
                              face = "bold")
  )

ggsave("Output/Plots/nyc_marathon_2025_split_difference_histogram.png",
       width = 900, height = 1200, dpi = 72, units = "px", bg = "white")

ggplot() +
  geom_abline(slope = 1, linetype = "dashed", color = "black") +
  geom_point(data = df,
             aes(x = split_HALF, y = split_SECOND_HALF, colour = Category),
             shape = 16, size = 1.5, alpha = 0.1) +
  scale_x_continuous(breaks = seq(from = 0, to = 12 * 30, by = 30),
                     labels = seq(from = 0, to = 6, by = 0.5),
                     limits = c(1 * 60, 5 * 60)) +
  scale_y_continuous(breaks = seq(from = 0, to = 12 * 30, by = 30),
                     labels = seq(from = 0, to = 6, by = 0.5),
                     limits = c(1 * 60, 5 * 60)) +
  scale_colour_manual(values = my_palette) +
  labs(x = "First half time (h)",
       y = "Second half time (h)",
       caption = cap) +
  theme_bw() +
  theme(
    plot.caption = element_textbox_simple(
      colour = "grey25",
      hjust = 0,
      halign = 0,
      margin = margin(b = 0, t = 10),
      size = rel(0.9)
    ),
    text = element_text(family = "roboto", size = 16)
  ) +
  guides(colour = guide_legend(override.aes = list(alpha = 1)))

ggsave("Output/Plots/nyc_marathon_2025_split_difference_scatter.png",
       width = 1000, height = 800, dpi = 72, units = "px", bg = "white")

From this data we can also make some calculations to understand…

## fitting ----

# to fit, we'll constrain the line to go through (60,60), i.e. a
# 2 h marathoner who runs even splits
fitting <- data.frame(x = df$split_HALF,y = df$split_SECOND_HALF)
lm( I(y-60) ~ I(x-60) + 0, data = fitting)


# Call:
#   lm(formula = I(y - 60) ~ I(x - 60) + 0, data = fitting)
# 
# Coefficients:
#   I(x - 60)  
# 1.239  

# so for a 90 minute first half, second half would be:
# 60 + 1.239 * (90 - 60) = 97.17 minutes, a finish time of 3:07:10

# to run a 3 h New York Marathon, the average runner needs to run
# 60 / 2.239 + 60 = 86.8 minutes for the first half
# so 1:26:48 for the first half, and 1:33:12 for the second half

# a more simple approach is to calculate the mean of the ratios
mean_ratio <- mean(df$split_SECOND_HALF / df$split_HALF)
mean_ratio
# [1] 1.127581

# filter the df for finish times between 170 and 180 minutes
target <- df[df$split_MAR > 170 & df$split_MAR < 180,]
summary(target)


    RunnerID         time.HALF           time.MAR           split_HALF      split_MAR     split_SECOND_HALF   Difference     
 Min.   :48819892   Length:1289        Length:1289        Min.   :70.25   Min.   :170.0   Min.   : 82.70    Min.   :-7.6833  
 1st Qu.:48834548   Class :character   Class :character   1st Qu.:84.42   1st Qu.:173.6   1st Qu.: 88.07    1st Qu.: 0.7167  
 Median :48849752   Mode  :character   Mode  :character   Median :86.30   Median :176.0   Median : 89.62    Median : 3.0500  
 Mean   :48849498                                         Mean   :85.98   Mean   :175.7   Mean   : 89.73    Mean   : 3.7585  
 3rd Qu.:48864551                                         3rd Qu.:87.87   3rd Qu.:178.2   3rd Qu.: 91.12    3rd Qu.: 5.9000  
 Max.   :48878979                                         Max.   :92.87   Max.   :180.0   Max.   :106.02    Max.   :35.7667  
 Difference_Fraction      Category   
 Min.   :-8.5008     Sub 3 h  :1289  
 1st Qu.: 0.8051     3:00-3:30:   0  
 Median : 3.5390     3:30-4:00:   0  
 Mean   : 4.5178     4:00-5:00:   0  
 3rd Qu.: 7.0055     Over 5 h :   0  
 Max.   :50.9134

—

The post title comes from “Marathon Man” by Ian Brown from his “My Way” album. He’s wearing a track suit on the cover but that’s not optimal wear for running a marathon.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Marathon Man: how to pace a marathon

One interface, (Almost) Every Classifier: unifiedml v0.2.1

T. Moudiki — Sat, 04 Apr 2026 00:00:00 +0000

[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new version of unifiedml is out; available on CRAN. unifiedml is an effort to offer a unified interface to R’s machine learning models.

The main change in this version 0.2.1 is the removal of type (of prediction) from predict, and the use of... instead, which is more generic and flexible.

This post contains advanced examples of use of unifiedml for classification, with ranger and xgboost. More examples have been added to the package vignettes too.

install.packages("unifiedml")

install.packages(c("ranger"))

library("unifiedml")

Loading required package: doParallel

Loading required package: foreach

Loading required package: iterators

Loading required package: parallel

Loading required package: R6

1 – `ranger` example

library(ranger)



# 2 - 'ranger' classification ---------------------------

# -------------------------------
# S3 wrapper for ranger
# -------------------------------

# Fit function remains the same
my_ranger <- function(x, y, ...) {
  if (!is.data.frame(x)) x <- as.data.frame(x)
  y <- as.factor(y)
  colnames(x) <- paste0("X", seq_len(ncol(x)))
  df <- data.frame(y = y, x)
  fit <- ranger::ranger(y ~ ., data = df, probability = TRUE, ...)
  structure(list(fit = fit), class = "my_ranger")
}

# Predict only with newdata
predict.my_ranger <- function(object, newdata = NULL, newx = NULL, ...) {
  if (!is.null(newx)) newdata <- newx
  if (is.null(newdata)) stop("No data provided for prediction")
#  misc::debug_print(newx)
#  misc::debug_print(newdata)
  if (is.matrix(newdata)) newdata <- as.data.frame(newdata)
#  misc::debug_print(newdata)
  # Unconditionally rename to match training
  colnames(newdata) <- paste0("X", seq_len(ncol(newdata)))
#  misc::debug_print(newdata)
  preds <- predict(object$fit, data = newdata)$predictions
#  misc::debug_print(newdata)
  if (is.matrix(preds) && ncol(preds) == 2) {
    lvls <- colnames(preds)
    return(ifelse(preds[, 2] > 0.5, lvls[2], lvls[1]))
  }

  preds
}

# Print method
print.my_ranger <- function(x, ...) {
  cat("my_ranger model\n")
  print(x$fit)
}

# -------------------------------
# Example: Iris binary classification
# -------------------------------

set.seed(123)
iris_binary <- iris[iris$Species %in% c("setosa", "versicolor"), ]
X_binary <- iris_binary[, 1:4]
y_binary <- as.factor(as.character(iris_binary$Species))

# Train/test split
train_idx <- sample(seq_len(nrow(X_binary)), size = 0.7 * nrow(X_binary))
X_train <- X_binary[train_idx, ]
y_train <- y_binary[train_idx]
X_test <- X_binary[-train_idx, ]
y_test <- y_binary[-train_idx]

# Initialize and fit model
# Initialize model
mod <- Model$new(my_ranger)

# Fit on training data only
mod$fit(X_train, y_train, num.trees = 150L)

# Predict on test set
preds <- mod$predict(X_test)

# Evaluate
table(Predicted = preds, True =y_test)
mean(preds == y_test)  # Accuracy



# 5-fold cross-validation on training set
cv_scores <- cross_val_score(
  mod,
  X_train,
  y_train,
  num.trees = 150L,
  cv = 5L
)

cv_scores
mean(cv_scores)  # average CV accuracy


            True
Predicted    setosa versicolor
  setosa         15          0
  versicolor      0         15

  |======================================================================| 100%

2 - `xgboost` example

library(xgboost)

my_xgboost <- function(x, y, ...) {
  
  # Convert to matrix safely
  if (!is.matrix(x)) {
    x <- as.matrix(x)
  }
  
  # Handle factors
  if (is.factor(y)) {
    y <- as.numeric(y) - 1
  }
  
  fit <- xgboost::xgboost(
    data = x,
    label = y,
    ...
  )
  
  structure(list(fit = fit), class = "my_xgboost")
}

predict.my_xgboost <- function(object, newdata, ...) {
  
  # Ensure matrix
  newdata <- as.matrix(newdata)
  
  preds <- predict(object$fit, newdata)
  
  # If binary classification → convert probs to class
  if (!is.null(object$fit$params$objective) &&
      grepl("binary", object$fit$params$objective)) {
    
    return(ifelse(preds > 0.5, 1, 0))
  }
  
  preds
}

predict.my_xgboost <- function(object, newdata = NULL, newx = NULL, ...) {
  
  # Accept both conventions
  if (!is.null(newx)) {
    newdata <- newx
  }
  
  newdata <- as.matrix(newdata)
  
  preds <- predict(object$fit, newdata)
  
  # Binary classification → class labels
  if (!is.null(object$fit$params$objective) &&
      grepl("binary", object$fit$params$objective)) {
    
    return(ifelse(preds > 0.5, 1, 0))
  }
  
  preds
}

print.my_xgboost <- function(x, ...) {
  cat("my_xgboost model\n")
  print(x$fit)
}


set.seed(123)  # for reproducibility

# Binary subset
iris_binary <- iris[iris$Species %in% c("setosa", "versicolor"), ]
X_binary <- as.matrix(iris_binary[, 1:4])
y_binary <- as.factor(as.character(iris_binary$Species))

# Split indices: 70% train, 30% test
train_idx <- sample(seq_len(nrow(X_binary)), size = 0.7 * nrow(X_binary))
X_train <- X_binary[train_idx, ]
y_train <- y_binary[train_idx]
X_test <- X_binary[-train_idx, ]
y_test <- y_binary[-train_idx]

# Initialize model
mod <- Model$new(my_xgboost)

# Fit on training data only
mod$fit(X_train, y_train, nrounds = 50, objective = "binary:logistic")

# Predict on test set
preds <- mod$predict(X_test)

# Evaluate
table(Predicted = preds, True =y_test)
mean(preds == y_test)  # Accuracy



# 5-fold cross-validation on training set
cv_scores <- cross_val_score(
  mod, 
  X_train, 
  y_train, 
  nrounds = 50, 
  objective = "binary:logistic", 
  cv = 5L
)

cv_scores
mean(cv_scores)  # average CV accuracy

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: One interface, (Almost) Every Classifier: unifiedml v0.2.1

You can just build your own programming language

Econometrics and Free Software — Fri, 03 Apr 2026 00:00:00 +0000

[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last summer, while relaxing on the beaches of Berck, a French town known for treating tuberculosis in kids by exposing them to the fresh maritime air (back in 19th century, they have antibiotics these days), I found myself daydreaming about building my own programming language.

Spoiler alert: I don’t know how to build programming languages, but I have developed extremely strong opinions over the years about the features a modern data science language should have. So could I use them fancy LLMs to build one?

Also, let’s get one question answered straight away: why create a new language instead of contributing to existing ones? I certainly do contribute, I maintain several R packages like {rix}, {rixpress}, and {chronicler}, and even have two Python packages (cronista and ryxpress), but I wanted a clean slate to build a system centered around a few non-negotiable principles and features I’ve implemented over the years in R:

Reproducibility-First: A language where reproducibility isn’t a bolt-on afterthought managed by external tools, but the very foundation of the runtime.
Aggressive Re-use: Instead of reinventing the wheel, this language would stand on the shoulders of giants. It’d use Nix for package management and environment isolation, and Apache Arrow as its high-performance backbone for data frames. R, Python, Julia and other languages would provide the algorithms and models.
First-Class Pipelines: Scripts shouldn’t be a sequence of side-effects. In this language, pipelines would be mandatory and first-class citizens.
Fail Early and Loudly: No silent type conversions or hidden NAs. If something is wrong, the language breaks immediately so you can fix it.
Errors as Objects: Inspired by functional programming, errors are first-class values that can be inspected and handled gracefully.
Two Pipes: I want two pipes, one for linear transformations, |>, and a maybe-pipe, ?|> for error recovery. Unlike the standard pipe, ?|> always forwards its value, including Errors, to the next function, allowing you to write handlers that inspect and potentially recover from them. Since Errors are just values, this composes naturally with the rest of the language.
Polyglot by Design: Rather than re-implementing every statistical algorithm, this language would be designed to orchestrate and bridge R, Python, and Julia seamlessly.

Also, we’re in a post LLM world, and like them or not, they’re here to stay. They’re pretty useful to write boilerplate code and so any new language would be dead on arrival if it didn’t play nicely with LLMs. So such a new language would need to be written for LLMs primarily, because I don’t expect anyone to learn any new language. This is where the declarative nature of Nix is a huge advantage. Because environments are precisely described, it is much easier for LLMs to focus on generating code and not have to fight with environment setup. This is also the reason I took another radical decision: since Nix would be mandatory for setting up the environment, why bother building OS-specific binaries? I’d just build a Nix package for this language and let Nix handle the rest.

This architecture results in a DSL for orchestration, making it trivial to transfer data objects between different ecosystems without the usual FFI (Foreign Function Interface) friction.

With these ideas in mind, I started prompting Gemini to brainstorm and started by generating specification files. Very broad first, but as days went by, more and more focused. The way I went about it (and still go) is that I first brainstorm an idea with an LLM, then I ask it to generate a specification file, then I refine it, ask it to generate a new specification file, and so on. Once I’m happy with the spec, I ask an LLM to generate a minimal implementation of the spec. Usually writing the spec and a first implementation is a task shared between Claude and Gemini (through Antigravity). Then I open a pull request and ask GitHub Copilot to review it (usually with GPT-5.x). I repeat this process until I’m happy with the implementation. I always ask for documentation and unit tests (and golden tests when relevant, more on this later).

I started to really believe that I had something interesting, so I gave it a shot, and called it T. I had long joked that the natural successor to R should be called T (because R is the successor to S… and no, I’m not going to call it Q because that sounds like the word for ass in French).

Something else that made me confident I could succeed, besides my own hubris, was that I am pretty familiar with unit testing, test-driven development, trunk-based development and Nix. When you combine all these elements, it makes developing with LLMs quite safe.

So I just started prompting. And now I’m quite happy to announce that there is a beta version of T that you can use today!

By leveraging Nix as a build engine, T can treat complex data science workflows as buildable derivations. A typical T pipeline looks like this:

p = pipeline {
  -- 1. Python node: read data with pandas
  mtcars_pl = pyn(
    command = <{
import pandas as pd
pd.read_csv("data/mtcars.csv", sep="|")
    }>,
    include = ["data/mtcars.csv"],
    serializer = ^csv
  )

  -- 2. Python node: filter and serialize as CSV
  mtcars_pl_am = pyn(
    command = <{
mtcars_pl[mtcars_pl['am'] == 1]
    }>,
    deserializer = ^csv,
    serializer = ^csv
  )

  -- 3. R node: read CSV and take head using functions.R
  mtcars_head = rn(
    command = <{
my_head(mtcars_pl_am)
    }>,
    functions = ["src/functions.R"],
    deserializer = ^csv,
    serializer = ^csv
  )

  -- 4. R node: select column with dplyr
  mtcars_mpg = rn(
    command = <{
library(dplyr)
mtcars_head %>% select(mpg)
    }>,
    deserializer = ^csv,
    serializer = ^csv
  )

  -- Render Quarto report
  report = node(script = "src/report.qmd", runtime = Quarto)
}

-- Materialize the pipeline
populate_pipeline(p, build = true)
pipeline_copy() -- Copy the outputs from the Nix store to your working directory

As you can see, each node has a command argument where you can write literal R or Python code. It is also possible to provide the path to a script instead. If packages need to be loaded for the code to work, you can just write the calls to load the required packages in the command argument as well.

While T is heavily inspired by the {targets} package in R, it takes the concept a step further by making pipelines first-class objects within the language itself. This means you can:

Compose Pipelines: You can define small, modular pipelines and then merge them into larger ones using standard operators.
Static Analysis: Because the DAG (Directed Acyclic Graph) is defined within the language, T can validate your entire workflow (checking for circular dependencies or missing data) before a single line of code even runs.
Heterogeneous Execution: A single pipeline can effortlessly mix R, Python, and native T code. Data is passed between these nodes using built-in serializers like ^csv, ^arrow, or even specialized formats like ^pmml for traditional models and ^onnx for deep learning architectures. It is also possible to define your own serializers.
Immutable State: Each node output is managed by Nix, meaning if you haven’t changed the code or the data for a specific node, T (via Nix) will simply pull the cached result from previous runs.

But don’t let the “orchestrator” label fool you; T is also a capable language in its own right. It features a selection of built-in packages inspired by the tidyverse for data manipulation. Thanks to its Arrow backend, it is surprisingly fast. I even maintain a CI benchmark running on NYC Taxi data to ensure performance remains competitive.

I made sure that T is pretty easy to use with LLMs by providing a file called summary.md in the root of the GitHub repository. This file is meant to be used by LLMs to quickly learn the language’s syntax and generate code accordingly. You could also provide the whole help documentation to the LLM (found in the repository under help/docs.json), but I found that a summary is usually enough. There is also another experimental feature I’m thinking about, called intent blocks. These blocks would essentially be first-class structured comments that would be used to anchor LLM’s behaviour and make it more deterministic. These blocks would be parsed by T and used to generate code accordingly. I have some ideas how these could look like, something like this:

intent {
  description: "Customer churn prediction",
  assumptions: ["Age > 18", "NA imputed with multiple imputation"],
  requires: ["dataset.csv"]
}

Is this slop?

There’s a lot of skepticism about building your own language using LLMs, and I get it. I was pretty skeptical myself. So let me tell you what actually gives me confidence in T’s correctness: as of writing, 1753 unit tests, 122 golden tests, 13 end-to-end tests, and 18 full project demos are executed on every push and PR, on both Linux and macOS via GitHub Actions. That’s the verification regime, and it has to be rigorous precisely because I can’t audit the OCaml implementation by eye. This is actually one of the more interesting lessons from this project: when you can’t rely on code review, you have to over-invest in tests and specifications. The spec files, the enriched changelog, the summary.md, all of that context makes the LLM’s output more predictable, and the test suite tells you immediately when it isn’t.

From personal experience, when I generate R or Python code, the output looks a lot like what I would have written myself. The main failure mode I’ve noticed is lack of context: the more you give the model, the better the result. Letting separate LLMs review PRs and iterating through several loops helps catch what any single model misses.

I’m also confident in T’s safety from a different angle: it’s ultimately orchestrating Python and R code you write yourself, and that you can test independently.

Interested?

If you’re interested in trying it out or contributing, check out the official repository or the website, and don’t hesitate to open an issue or a PR or contact me on the dedicated Matrix (https://matrix.to/#/#tproject:matrix.org) channel.

Appendix

For the interested reader, here’s how to get started with T.

How to get started

If you have Nix installed, getting started with a new project is just a single command away:

# 1. Initialize a new project
nix run github:b-rodrigues/tlang -- init --project my_t_project
cd my_t_project

There will be no other way to start a T project. As explained above, I don’t want to have to deal with providing OS-specific binaries, and since Nix is used by T as the build engine, you’ll need to have Nix installed on your system anyways. Might as well reuse it to manage the install T itself!

Inside the project’s folder, you’ll find a tproject.toml file. This is were you list R and Python packages you’ll need. For example:

[project]
name = "r_py_xgboost_t"
description = "A T data analysis project"

[dependencies]
# T packages this project depends on
# Format: package = { git = "repository-url", tag = "version" }
# Example:
# stats = { git = "https://github.com/t-lang/stats", tag = "v0.5.0" }

[r-dependencies]
packages = ["dplyr", "yardstick"]

[py-dependencies]
version = "python313"
packages = ["numpy", "pandas", "scikit-learn", "xgboost"]

[additional-tools]
packages = ["quarto"]

[t]
# Minimum T language version required
min_version = "0.51.2"

Under “additional tools” you can add any package that is available in nixpkgs. If you need LaTeX, you can also add this dedicated section:

\(\)
packages = ["amsmath", "geometry", "hyperref", "biblatex"]

You may have noticed that there is also a section for T packages; that’s right, T supports user-defined packages. Instead of starting a project you’d start a package:

nix run github:b-rodrigues/tlang -- init --package my_package
cd my_package

Instead of a tproject.toml file, you’ll have to fill a DESCRIPTION.toml file:

[package]
name = "my_package"
version = "0.1.0"
description = "A brief description of what my_package does"
authors = ["brodriguesco"]
license = "EUPL-1.2"
homepage = ""
repository = ""

[dependencies]
# T packages this package depends on
# Format: package = { git = "repository-url", tag = "version" }

[t]
# Minimum T language version required
min_version = "0.5.0"

Another important file is the flake.nix that will be automatically generated. You shouldn’t have to touch it, but this flake.nix is what provides the reproducible development environment for running your project. To do so, simply use:

nix develop

This will install T and activate the environment. If you’ve added stuff to the tproject.toml you’ll have to run t update to sync the packages to the flake, and then rebuild the environment (you’ll need to exit the development environment with exit and rebuild it using nix develop again). Oh and by the way, T requires a Linux-like environment so if you’re on Windows, you’ll have to run T within WSL2 (Windows Subsystem for Linux).

Once inside the nix develop shell, everything you need, the T interpreter, your specific versions of R/Python, and all project tools, is ready to use. You don’t need to manage virtual environments or Docker containers manually; T handles the heavy lifting via Nix under the hood.

You can browse examples on this repository.

Tooling and Editor Support

A language is only as good as its developer experience. I politely asked LLMs to implement a full Language Server (LSP) for T, which provides autocompletion, real-time diagnostics, and “Go to Definition” support.

For VS Code / Positron: A dedicated extension providing syntax highlighting and LSP integration.
For Vim / Emacs: Detailed configuration guides and syntax files are available.
For Quarto: T is fully compatible with Quarto for literate programming, allowing you to run executable {t} chunks directly in your documents.

For detailed setup instructions, check out the Editor Support guide in the official documentation.

There’s much more I haven’t covered here, so check out the official repository or the website.

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: You can just build your own programming language

R-bloggers

What’s new in R 4.6.0?

! (values %in% collection)

DOI citations

summary(character_vector, character.method = "factor")

list.files(..., fixed = TRUE)

Other matters

Trying out R 4.6.0

See also

My Domain: proteome-wide scanning of TMDs

Human

Model organisms

The code

Expanding the Editorial Team: Alec Robitaille and Lucy D’Agostino McGowan Join as Editors

Alec Robitaille

Lucy D’Agostino McGowan

About the Software Peer Review Program

Get Involved

Stage II OSCC — Health Economics Model

Overview of the Model

State Diagram for a Continuous Time Markov Chain

Why Most Time Series Models Fail Before They Start

1 A model can run and still be fundamentally wrong

2 What stationarity really means

3 Load the CPI data from a CSV file

4 Start with the visual story, not the test statistic

4.1 The raw CPI level

4.2 Rolling summaries to deepen the visual diagnosis

5 Why raw CPI levels are a good example

6 A statistical check: the Augmented Dickey-Fuller test

7 The first rescue: differencing

8 The more meaningful rescue: log differences

9 Re-test after transformation

10 A subtle but crucial point: transformation changes interpretation

11 Why this matters for ARIMA-style modeling

12 A compact comparison

13 A compact comparison

14 Common mistakes

15 Final thoughts

16 References and further reading

16.1 Data sources

16.2 Core time series references

16.3 Stationarity and unit root testing

16.4 Transformations and interpretation

16.5 Practical R resources

16.6 Suggested next steps for readers

logrittr: A Verbose Pipe Operator for Logging dplyr Pipelines

Motivation

Multiples contexts

Pro

Educational

Installation

Usage

Screenshot

Related package: tidylog

Working with lumberjack

Limitations

Take another pipe for a spin

Dealing with correlation in designed field experiments: part II

Fitting a bivariate model

A solution with ‘lme()’

References

Programming with LLMs in R & Python

Working with LLMs in Practice

What We’ll Cover

Why This Matters

Who Should Attend

Webinar Details

Speaker

Related Jumping Rivers Training

Join Us

Marathon Man II: how to pace a marathon

The idea

Going for sub-3

On-pace, positive split, negative split?

Tell me the odds

The best strategy?

Caveats

The code

Do AI coding agents save scientists time?

`! (values %in% collection)`

`summary(character_vector, character.method = "factor")`

`list.files(..., fixed = TRUE)`

Related package: `tidylog`

Working with `lumberjack`

`caret` wrapper