ARROW-7641: [R] Make dataset vignette have executable code:

nealrichardson · hadley · nealrichardson · commit 5bdb3afaccb6 · 2020-04-01T16:30:16.000-07:00
This patch makes the dataset vignette executable, yet it also leaves in the expected output and conditionally shows it if the taxi data is not found locally. That way, the rendered vignette always looks useful. In addition, the dataset print method is improved to report the number and format of the files it contains. Closes apache#6247 from hadley/vignette-eval Lead-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Co-authored-by: Hadley Wickham <h.wickham@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
diff --git a/r/.gitignore b/r/.gitignore
@@ -13,5 +13,6 @@ src/Makevars
 src/Makevars.win
 windows/
 libarrow/
+vignettes/nyc-taxi/
 arrow_*.tar.gz
 arrow_*.tgz
diff --git a/r/R/arrow-package.R b/r/R/arrow-package.R
@@ -39,6 +39,8 @@
       s3_register(m, cl)
     }
   }
+
+  s3_register("dplyr::tbl_vars", "arrow_dplyr_query")
   s3_register("reticulate::py_to_r", "pyarrow.lib.Array")
   s3_register("reticulate::py_to_r", "pyarrow.lib.RecordBatch")
   s3_register("reticulate::r_to_py", "Array")
@@ -81,8 +83,14 @@ ArrowObject <- R6Class("ArrowObject",
       }
       assign(".:xp:.", xp, envir = self)
     },
-    print = function(...){
-      cat(class(self)[[1]], "\n", sep = "")
+    print = function(...) {
+      if (!is.null(self$.class_title)) {
+        # Allow subclasses to override just printing the class name first
+        class_title <- self$.class_title()
+      } else {
+        class_title <- class(self)[[1]]
+      }
+      cat(class_title, "\n", sep = "")
       if (!is.null(self$ToString)){
         cat(self$ToString(), "\n", sep = "")
       }
diff --git a/r/R/dataset.R b/r/R/dataset.R
@@ -153,6 +153,24 @@ dim.Dataset <- function(x) c(x$num_rows, x$num_cols)
 #' @rdname Dataset
 #' @export
 FileSystemDataset <- R6Class("FileSystemDataset", inherit = Dataset,
+  public = list(
+    .class_title = function() {
+      nfiles <- length(self$files)
+      file_type <- self$format$type
+      pretty_file_type <- list(
+        parquet = "Parquet",
+        ipc = "Feather"
+      )[[file_type]]
+
+      paste(
+        class(self)[[1]],
+        "with",
+        nfiles,
+        pretty_file_type %||% file_type,
+        ifelse(nfiles == 1, "file", "files")
+      )
+    }
+  ),
   active = list(
     #' @description
     #' Return the files contained in this `FileSystemDataset`
diff --git a/r/R/dplyr.R b/r/R/dplyr.R
@@ -90,6 +90,8 @@ dim.arrow_dplyr_query <- function(x) {
 }
 
 # The following S3 methods are registered on load if dplyr is present
+tbl_vars.arrow_dplyr_query <- function(x) names(x$selected_columns)
+
 select.arrow_dplyr_query <- function(.data, ...) {
   column_select(arrow_dplyr_query(.data), !!!enquos(...))
 }
diff --git a/r/tests/testthat/test-dataset.R b/r/tests/testthat/test-dataset.R
@@ -336,7 +336,7 @@ test_that("Dataset and query print methods", {
   expect_output(
     print(ds),
     paste(
-      "Dataset",
+      "FileSystemDataset with 2 Parquet files",
       "int: int32",
       "dbl: double",
       "lgl: bool",
diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd
@@ -23,49 +23,65 @@ is widely used in big data exercises and competitions.
 For demonstration purposes, we have hosted a Parquet-formatted version
 of about 10 years of the trip data in a public S3 bucket.
 
+The total file size is around 37 gigabytes, even in the efficient Parquet file format.
+That's bigger than memory on most people's computers,
+so we can't just read it all in and stack it into a single data frame.
+
 In a future release, you'll be able to point your R session at S3 and query
 the dataset from there. For now, datasets need to be on your local file system.
 To download the files,
 
-```r
+```{r, eval = FALSE}
 bucket <- "https://ursa-labs-taxi-data.s3.us-east-2.amazonaws.com"
 dir.create("nyc-taxi")
 for (year in 2009:2019) {
-  for (month in 1:12) {
+  dir.create(file.path("nyc-taxi", year))
+  if (year == 2019) {
+    # We only have through June 2019 there
+    months <- 1:6
+  } else {
+    months <- 1:12
+  }
+  for (month in months) {
     if (month < 10) {
       month <- paste0("0", month)
     }
-    try(download.file(
+    dir.create(file.path("nyc-taxi", year, month))
+    download.file(
       paste(bucket, year, month, "data.parquet", sep = "/"),
       file.path("nyc-taxi", year, month, "data.parquet")
-    ))
+    )
   }
 }
 ```
 
-It is expected that some files will not download because they do not exist--December 2019,
-for example--hence the `try()`.
-The total file size is around 37 gigabytes, even in the efficient Parquet file format.
-That's bigger than memory on most people's computers,
-so we can't just read it all in and stack it into a single data frame.
-
+Note that the vignette will not execute that code chunk: if you want to run
+with live data, you'll have to do it yourself separately.
 Given the size, if you're running this locally and don't have a fast connection,
 feel free to grab only a year or two of data.
 
+If you don't have the taxi data downloaded, the vignette will still run and will
+yield previously cached output for reference. To be explicit about which version
+is running, let's check whether we're running with live data:
+
+```{r}
+dir.exists("nyc-taxi")
+```
+
 ## Getting started
 
 Because `dplyr` is not necessary for many Arrow workflows,
 it is an optional (`Suggests`) dependency. So, to work with Datasets,
 we need to load both `arrow` and `dplyr`.
 
-```r
-library(arrow)
-library(dplyr)
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
 The first step is to create our Dataset object, pointing at the directory of data.
 
-```r
+```{r, eval = file.exists("nyc-taxi")}
 ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
 ```
 
@@ -92,10 +108,12 @@ and 1 for "month", even though those columns may not actually be present in the
 Indeed, when we look at the dataset, we see that in addition to the columns present
 in every file, there are also columns "year" and "month".
 
-```
+```{r, eval = file.exists("nyc-taxi")}
 ds
-
-## Dataset
+```
+```{r, echo = FALSE, eval = !file.exists("nyc-taxi")}
+cat("
+## FileSystemDataset with 125 Parquet files
 ## vendor_id: string
 ## pickup_at: timestamp[us]
 ## dropoff_at: timestamp[us]
@@ -122,6 +140,7 @@ ds
 ## month: int32
 
 See $metadata for additional Schema metadata
+")
 ```
 
 The other form of partitioning currently supported is [Hive](https://hive.apache.org/)-style,
@@ -160,7 +179,7 @@ Here's an example. Suppose I was curious about tipping behavior among the
 longest taxi rides. Let's find the median tip percentage for rides with
 fares greater than $100 in 2015, broken down by the number of passengers:
 
-```r
+```{r, eval = file.exists("nyc-taxi")}
 system.time(ds %>%
   filter(total_amount > 100, year == 2015) %>%
   select(tip_amount, total_amount, passenger_count) %>%
@@ -173,7 +192,8 @@ system.time(ds %>%
   print())
 ```
 
-```
+```{r, echo = FALSE, eval = !file.exists("nyc-taxi")}
+cat("
 ## # A tibble: 10 x 3
 ##    passenger_count tip_pct      n
 ##              <int>   <dbl>  <int>
@@ -189,31 +209,34 @@ system.time(ds %>%
 ## 10               9   16.7      42
 ##
 ##    user  system elapsed
-##  25.227   1.162   3.767
+##   4.436   1.012   1.402
+")
 ```
 
 We just selected a window out of a dataset with around 2 billion rows
-and aggregated on it in under 4 seconds on my laptop. How does this work?
+and aggregated on it in under 2 seconds on my laptop. How does this work?
 
 First, `select()`/`rename()`, `filter()`, and `group_by()`
 record their actions but don't evaluate on the data until you run `collect()`.
 
-```r
+```{r, eval = file.exists("nyc-taxi")}
 ds %>%
   filter(total_amount > 100, year == 2015) %>%
   select(tip_amount, total_amount, passenger_count) %>%
   group_by(passenger_count)
 ```
 
-```
-## Dataset (query)
+```{r, echo = FALSE, eval = !file.exists("nyc-taxi")}
+cat("
+## FileSystemDataset (query)
 ## tip_amount: float
 ## total_amount: float
 ## passenger_count: int8
 ##
 ## * Filter: ((total_amount > 100:double) and (year == 2015:double))
 ## * Grouped by passenger_count
 ## See $.data for the source Arrow object
+")
 ```
 
 This returns instantly and shows the window selection you've made, without
@@ -257,4 +280,4 @@ In the future, when there is support for cloud storage and other file formats,
 this would mean you could point to an S3 bucked of Parquet data and a directory
 of CSVs on the local file system and query them together as a single dataset.
 To create a multi-source dataset, provide a list of sources to `open_dataset()`
-instead of a file path. See `?open_source` for creating data sources. 
+instead of a file path. See `?dataset_factory` for creating data sources.

Original file line number	Diff line number	Diff line change
`@@ -90,6 +90,8 @@ dim.arrow_dplyr_query <- function(x) {`
`90`	`90`	`}`
`91`	`91`
`92`	`92`	`# The following S3 methods are registered on load if dplyr is present`
	`93`	`+tbl_vars.arrow_dplyr_query <- function(x) names(x$selected_columns)`
	`94`	`+`
`93`	`95`	`select.arrow_dplyr_query <- function(.data, ...) {`
`94`	`96`	`column_select(arrow_dplyr_query(.data), !!!enquos(...))`
`95`	`97`	`}`