4

I have a dataset with nine cities. I trained and tested four different machine learning models for each city. The results are in the tibble below:

set.seed(1)

result <- 
  tibble::tibble(city = letters[1:9],
                 m1_train = runif(9),
                 m1_test = runif(9),
                 m2_train = runif(9),
                 m2_test = runif(9),
                 m3_train = runif(9),
                 m3_test = runif(9),
                 m4_train = runif(9),
                 m4_test = runif(9))

result
#> # A tibble: 9 × 9
#>   city  m1_train m1_test m2_train m2_test m3_train m3_test m4_train m4_test
#>   <chr>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>
#> 1 a        0.266  0.0618   0.380    0.382    0.794  0.789    0.0707  0.332 
#> 2 b        0.372  0.206    0.777    0.870    0.108  0.0233   0.0995  0.651 
#> 3 c        0.573  0.177    0.935    0.340    0.724  0.477    0.316   0.258 
#> 4 d        0.908  0.687    0.212    0.482    0.411  0.732    0.519   0.479 
#> 5 e        0.202  0.384    0.652    0.600    0.821  0.693    0.662   0.766 
#> 6 f        0.898  0.770    0.126    0.494    0.647  0.478    0.407   0.0842
#> 7 g        0.945  0.498    0.267    0.186    0.783  0.861    0.913   0.875 
#> 8 h        0.661  0.718    0.386    0.827    0.553  0.438    0.294   0.339 
#> 9 i        0.629  0.992    0.0134   0.668    0.530  0.245    0.459   0.839

In this tibble m1_train is the RMSE obtained by model 1 for the train set, m1_test is the RMSE obtained by model 1 for the test set and so on.

I'd like to create two new columns in my tibble:

  1. min_train is the minimum RMSE only for the columns that end with _train
  2. min_test is the minimum RMSE only for the columns that end with _test

I've been trying too many different approaches (rowwise(), mutate(vars(ends_with("_train"))) and others), without success.

How can I approach his problem?

4
  • 3
    e.g. min_train = do.call(pmin, result[endsWith(names(result), 'train')]); min_test = do.call(pmin, result[endsWith(names(result), 'test')]). This task does not require any external libraries. Commented Oct 28 at 14:36
  • 1
    Highly inexperienced and inexperienced users tend to prefer dplyr (due to syntax) no matter the cost. I recommend to never use rowwise(); when going with dplyr::mutate I would opt for mutate(min_train = Rfast::rowMins(as.matrix(across(ends_with('train'))), value=TRUE), ...) Commented Oct 28 at 15:04
  • 1
    Note that the docs of rowwise state "[...] This is most useful when a vectorised function doesn't exist. [...]". Commented Oct 28 at 15:10
  • 2
    base R, tidy (add column selection either with ends_with or grepl) or dt Commented Oct 28 at 15:32

3 Answers 3

4

For the record, combining Friede's pmin with dplyr is straight-forward.

library(dplyr)
result |>
  mutate(
    min_train = do.call(pmin, pick(ends_with("train"))),
    min_test = do.call(pmin, pick(ends_with("test")))
  )
# # A tibble: 9 × 11
#   city  m1_train m1_test m2_train m2_test m3_train m3_test m4_train m4_test min_train min_test
#   <chr>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>     <dbl>    <dbl>
# 1 a        0.266  0.0618   0.380    0.382    0.794  0.789    0.0707  0.332     0.0707   0.0618
# 2 b        0.372  0.206    0.777    0.870    0.108  0.0233   0.0995  0.651     0.0995   0.0233
# 3 c        0.573  0.177    0.935    0.340    0.724  0.477    0.316   0.258     0.316    0.177 
# 4 d        0.908  0.687    0.212    0.482    0.411  0.732    0.519   0.479     0.212    0.479 
# 5 e        0.202  0.384    0.652    0.600    0.821  0.693    0.662   0.766     0.202    0.384 
# 6 f        0.898  0.770    0.126    0.494    0.647  0.478    0.407   0.0842    0.126    0.0842
# 7 g        0.945  0.498    0.267    0.186    0.783  0.861    0.913   0.875     0.267    0.186 
# 8 h        0.661  0.718    0.386    0.827    0.553  0.438    0.294   0.339     0.294    0.339 
# 9 i        0.629  0.992    0.0134   0.668    0.530  0.245    0.459   0.839     0.0134   0.245 

rowwise() makes some things easier, and for small data is perfectly fine. As your data size grows, it can be significantly slower; I wouldn't be worried about it until many more rows. As an example, if result has 10,000 rows, then the rowwise() method takes over 3 seconds, and this code is nearly instantaneous.

If you don't like do.call and really need to stick with tidyverse-functions, replace it with purrr::invoke for the same results; the runtime is still fast, though with 10K-row data do.call is almost 50% faster than invoke (not sure why). Still much faster than rowwise(). Edit: invoke is deprecated in favor of rlang::exec.

Sign up to request clarification or add additional context in comments.

6 Comments

Friede, perhaps instead of "perfectly fine" I could use "acceptable", but more-so as a distinction between "always implement best-practices and the fastest code" and "welcome to R, this is one way".
Friede, I don't understand your issue here. Are you taking so much issue with my use of "perfectly fine" that you are soap-boxing base-R on a question tagged with dplyr? Or are you defending against perceived criticism? For the latter, I think nobody is suggesting anything from your comment or answer. For the former, there is definitely value in showing different "dialects" in answers, but I do think that the original requested dialect should be addressed.
@Friede, I apologize for misinterpreting your objection. Other than the potential for hyperbole in "perfectly fine", is there another point?
Just curious wasn't invoke and lift deprecated in favor of exec?
Yes it was, thanks @Onyambu
|
3

The various other answers are good. Since you were initially trying rowwise, and I find people often stumble over getting the syntax right using rowwise, this will work using c_across:

result |> 
  rowwise() |> 
  mutate(
    min_train = min(c_across(ends_with("train"))),
    min_test = min(c_across(ends_with("test")))
  ) |> 
  ungroup()

Comments

1

Sticking to your preferred library, sometimes we are looking for:

result |>
  tidyr::pivot_longer(cols=-city, names_to=c('mod', 'set'),                 
                      names_pattern='(m\\d+)_(train|test)', values_to='val') |>
  dplyr::filter(val==min(val), .by=c(city, set)) 

-output

# A tibble: 18 × 4
   city  mod   set      val
   <chr> <chr> <chr>  <dbl>
 1 a     m1    test  0.0618
 2 a     m4    train 0.0707
 3 b     m3    test  0.0233
 4 b     m4    train 0.0995
 5 c     m1    test  0.177 
 6 c     m4    train 0.316 
 7 d     m2    train 0.212 
 8 d     m4    test  0.479 
 9 e     m1    train 0.202 
10 e     m1    test  0.384 
11 f     m2    train 0.126 
12 f     m4    test  0.0842
13 g     m2    train 0.267 
14 g     m2    test  0.186 
15 h     m4    train 0.294 
16 h     m4    test  0.339 
17 i     m2    train 0.0134
18 i     m3    test  0.245 

EDIT to add the suggestion from comment below question:

result |> 
  transform(
    min_train = do.call('pmin', result[endsWith(names(result), 'train')]),    
    min_test = do.call('pmin', result[endsWith(names(result), 'test')])
    )

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.