3

I'm trying to select lines from one table ("positons") with values for a particular column ("position") that fall within the ranges defined in another ("my_ranges") table, and then to add a grouping tag from the "my_ranges" table.

I can do this using tibbles and a couple purrr::map2 calls, but the same approach doesn't work with dbplyr database-tibbles. Is this expected behavior, and if so, is there a different approach that I should take to use dbplyr for this kind of task?

Here's my example:

library("tidyverse")
set.seed(42)

my_ranges <-
  tibble(
    group_id = c("a", "b", "c", "d"),
    start = c(1, 7, 2, 25),
    end = c(5, 23, 7, 29)
    )

positions <-
  tibble(
    position = as.integer(runif(n = 100, min = 0, max = 30)),
    annotation = stringi::stri_rand_strings(n = 100, length = 10)
  )

# note: this works as I expect and returns a tibble with 106 obs of 3 variables:
result <- map2(.x = my_ranges$start, .y = my_ranges$end,
             .f = function(x, y) {between(positions$position, x, y)}) %>%
  map2(.y = my_ranges$group_id,
              .f = function(x, y){
                positions %>%
                  filter(x) %>%
                  mutate(group_id = y)}
) %>% bind_rows()

# next, make an in-memory db for testing:
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")

# copy data to db
copy_to(con, my_ranges, "my_ranges", temporary = FALSE)
copy_to(con, positions, "positions", temporary = FALSE)

# get db-backed tibbles:
my_ranges_db <- tbl(con, "my_ranges")
positions_db <- tbl(con, "positions")

# note: this does not work as I expect, and instead returns a tibble with 0 obsevations of 0 variables:
# database range-based query:
db_result <- map2(.x = my_ranges_db$start, .y = my_ranges_db$end,
                  .f = function(x, y) {
                    between(positions_db$position, x, y)
                    }) %>%
  map2(.y = my_ranges_db$group_id,
       .f = function(x, y){
         positions_db %>%
           filter(x) %>%
           mutate(group_id = y)}
  ) %>% bind_rows()
1
  • I can do this in straight SQL: range_query <- paste0(" SELECT ranges.group_id as group_id, positions.position as position, positions.annotation as annotation FROM (SELECT * FROM my_ranges) AS ranges, positions WHERE positions.position BETWEEN ranges.start AND ranges.end; ") results <- DBI::dbGetQuery(con, range_query) Commented Mar 26, 2018 at 18:36

2 Answers 2

6

As long as each iteration creates a table of the same dimensions, then there may be a neat way of pushing the entire operation to the database. The idea is to use both map() and reduce() from purrr. Each tbl_sql() operation is lazy, so we can iterate through them without worrying of sending a bunch of queries, and then we can use union() which will basically append the resulting SQL from each iteration to the next using the UNION clause from the given database. Here's an example:

library(dbplyr, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(purrr, warn.conflicts = FALSE)
library(DBI, warn.conflicts = FALSE)
library(rlang, warn.conflicts = FALSE)

con <- DBI::dbConnect(RSQLite::SQLite(), path = ":dbname:")

db_mtcars <- copy_to(con, mtcars)

cyls <- c(4, 6, 8)

all <- cyls %>%
  map(~{
    db_mtcars %>%
      filter(cyl == .x) %>%
      summarise(mpg = mean(mpg, na.rm = TRUE)
      )
  }) %>%
  reduce(function(x, y) union(x, y)) 

all
#> # Source:   lazy query [?? x 1]
#> # Database: sqlite 3.22.0 []
#>     mpg
#>   <dbl>
#> 1  15.1
#> 2  19.7
#> 3  26.7

show_query(all)
#> <SQL>
#> SELECT AVG(`mpg`) AS `mpg`
#> FROM (SELECT *
#> FROM (SELECT *
#> FROM `mtcars`)
#> WHERE (`cyl` = 4.0))
#> UNION
#> SELECT AVG(`mpg`) AS `mpg`
#> FROM (SELECT *
#> FROM (SELECT *
#> FROM `mtcars`)
#> WHERE (`cyl` = 6.0))
#> UNION
#> SELECT AVG(`mpg`) AS `mpg`
#> FROM (SELECT *
#> FROM (SELECT *
#> FROM `mtcars`)
#> WHERE (`cyl` = 8.0))

dbDisconnect(con)
Sign up to request clarification or add additional context in comments.

1 Comment

High quality answer right here, does what you'd like it to do!
3

dbplyr translates R into SQL. Lists don't exist in SQL. map creates lists. Thus it's impossible to translate map into SQL.

Mainly dplyr functions and some base functions are translated, they're working on tidyr functions too as I understood. When using dbplyr try to have an SQL logic in your approach or it will easily break.

5 Comments

Thanks! That helps clarify my thinking!
so is there no way of running something simple like this remotely iris_db %>% sapply(unique) (iris_db %>% map(unique))?
Try something like map(names(iris_db), ~select_at(iris_db,.x) %>% distinct() %>% collect()). It will run a separate sql query on each column
great thanks, I was only able to get it to work though using this type of approach based on the vignette example here: map(dbListFields(con, "flights"), ~select_at(flights_db,.x) %>% distinct() %>% collect()) I had to replace names(flights_db) with dbListFields(con, "flights"). Is there a way to use names command on a db dataset to print column names? Its a lot more intuitive than typing dbListFields
Oh yes sorry. I think colnames() would work too.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.