Using purrr::map2() with dbplyr

Question

I'm trying to select lines from one table ("positons") with values for a particular column ("position") that fall within the ranges defined in another ("my_ranges") table, and then to add a grouping tag from the "my_ranges" table.

I can do this using tibbles and a couple purrr::map2 calls, but the same approach doesn't work with dbplyr database-tibbles. Is this expected behavior, and if so, is there a different approach that I should take to use dbplyr for this kind of task?

Here's my example:

library("tidyverse")
set.seed(42)

my_ranges <-
  tibble(
    group_id = c("a", "b", "c", "d"),
    start = c(1, 7, 2, 25),
    end = c(5, 23, 7, 29)
    )

positions <-
  tibble(
    position = as.integer(runif(n = 100, min = 0, max = 30)),
    annotation = stringi::stri_rand_strings(n = 100, length = 10)
  )

# note: this works as I expect and returns a tibble with 106 obs of 3 variables:
result <- map2(.x = my_ranges$start, .y = my_ranges$end,
             .f = function(x, y) {between(positions$position, x, y)}) %>%
  map2(.y = my_ranges$group_id,
              .f = function(x, y){
                positions %>%
                  filter(x) %>%
                  mutate(group_id = y)}
) %>% bind_rows()

# next, make an in-memory db for testing:
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")

# copy data to db
copy_to(con, my_ranges, "my_ranges", temporary = FALSE)
copy_to(con, positions, "positions", temporary = FALSE)

# get db-backed tibbles:
my_ranges_db <- tbl(con, "my_ranges")
positions_db <- tbl(con, "positions")

# note: this does not work as I expect, and instead returns a tibble with 0 obsevations of 0 variables:
# database range-based query:
db_result <- map2(.x = my_ranges_db$start, .y = my_ranges_db$end,
                  .f = function(x, y) {
                    between(positions_db$position, x, y)
                    }) %>%
  map2(.y = my_ranges_db$group_id,
       .f = function(x, y){
         positions_db %>%
           filter(x) %>%
           mutate(group_id = y)}
  ) %>% bind_rows()

I can do this in straight SQL: range_query <- paste0(" SELECT ranges.group_id as group_id, positions.position as position, positions.annotation as annotation FROM (SELECT * FROM my_ranges) AS ranges, positions WHERE positions.position BETWEEN ranges.start AND ranges.end; ") results <- DBI::dbGetQuery(con, range_query) — bheavner
– bheavner, Commented Mar 26, 2018 at 18:36

edgararuiz · Accepted Answer · 2018-09-22 21:52:17Z

As long as each iteration creates a table of the same dimensions, then there may be a neat way of pushing the entire operation to the database. The idea is to use both map() and reduce() from purrr. Each tbl_sql() operation is lazy, so we can iterate through them without worrying of sending a bunch of queries, and then we can use union() which will basically append the resulting SQL from each iteration to the next using the UNION clause from the given database. Here's an example:

library(dbplyr, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(purrr, warn.conflicts = FALSE)
library(DBI, warn.conflicts = FALSE)
library(rlang, warn.conflicts = FALSE)

con <- DBI::dbConnect(RSQLite::SQLite(), path = ":dbname:")

db_mtcars <- copy_to(con, mtcars)

cyls <- c(4, 6, 8)

all <- cyls %>%
  map(~{
    db_mtcars %>%
      filter(cyl == .x) %>%
      summarise(mpg = mean(mpg, na.rm = TRUE)
      )
  }) %>%
  reduce(function(x, y) union(x, y)) 

all
#> # Source:   lazy query [?? x 1]
#> # Database: sqlite 3.22.0 []
#>     mpg
#>   <dbl>
#> 1  15.1
#> 2  19.7
#> 3  26.7

show_query(all)
#> <SQL>
#> SELECT AVG(`mpg`) AS `mpg`
#> FROM (SELECT *
#> FROM (SELECT *
#> FROM `mtcars`)
#> WHERE (`cyl` = 4.0))
#> UNION
#> SELECT AVG(`mpg`) AS `mpg`
#> FROM (SELECT *
#> FROM (SELECT *
#> FROM `mtcars`)
#> WHERE (`cyl` = 6.0))
#> UNION
#> SELECT AVG(`mpg`) AS `mpg`
#> FROM (SELECT *
#> FROM (SELECT *
#> FROM `mtcars`)
#> WHERE (`cyl` = 8.0))

dbDisconnect(con)

High quality answer right here, does what you'd like it to do!

moodymudskipper · Accepted Answer · 2018-03-27 22:19:55Z

3

dbplyr translates R into SQL. Lists don't exist in SQL. map creates lists. Thus it's impossible to translate map into SQL.

Mainly dplyr functions and some base functions are translated, they're working on tidyr functions too as I understood. When using dbplyr try to have an SQL logic in your approach or it will easily break.

edited Mar 27, 2018 at 22:19

answered Mar 27, 2018 at 20:22

moodymudskipper

47.7k12 gold badges131 silver badges185 bronze badges

5 Comments

bheavner Over a year ago

Thanks! That helps clarify my thinking!

user63230 Over a year ago

so is there no way of running something simple like this remotely iris_db %>% sapply(unique) (iris_db %>% map(unique))?

moodymudskipper Over a year ago

Try something like map(names(iris_db), ~select_at(iris_db,.x) %>% distinct() %>% collect()). It will run a separate sql query on each column

user63230 Over a year ago

great thanks, I was only able to get it to work though using this type of approach based on the vignette example here: map(dbListFields(con, "flights"), ~select_at(flights_db,.x) %>% distinct() %>% collect()) I had to replace names(flights_db) with dbListFields(con, "flights"). Is there a way to use names command on a db dataset to print column names? Its a lot more intuitive than typing dbListFields

moodymudskipper Over a year ago

Oh yes sorry. I think colnames() would work too.

Collectives™ on Stack Overflow

Using purrr::map2() with dbplyr

2 Answers 2

1 Comment

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related