Skip to content

Commit 5bdb3af

Browse files
ARROW-7641: [R] Make dataset vignette have executable code:
This patch makes the dataset vignette executable, yet it also leaves in the expected output and conditionally shows it if the taxi data is not found locally. That way, the rendered vignette always looks useful. In addition, the dataset print method is improved to report the number and format of the files it contains. Closes apache#6247 from hadley/vignette-eval Lead-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Co-authored-by: Hadley Wickham <h.wickham@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
1 parent e381a72 commit 5bdb3af

6 files changed

Lines changed: 80 additions & 28 deletions

File tree

r/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,5 +13,6 @@ src/Makevars
1313
src/Makevars.win
1414
windows/
1515
libarrow/
16+
vignettes/nyc-taxi/
1617
arrow_*.tar.gz
1718
arrow_*.tgz

r/R/arrow-package.R

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@
3939
s3_register(m, cl)
4040
}
4141
}
42+
43+
s3_register("dplyr::tbl_vars", "arrow_dplyr_query")
4244
s3_register("reticulate::py_to_r", "pyarrow.lib.Array")
4345
s3_register("reticulate::py_to_r", "pyarrow.lib.RecordBatch")
4446
s3_register("reticulate::r_to_py", "Array")
@@ -81,8 +83,14 @@ ArrowObject <- R6Class("ArrowObject",
8183
}
8284
assign(".:xp:.", xp, envir = self)
8385
},
84-
print = function(...){
85-
cat(class(self)[[1]], "\n", sep = "")
86+
print = function(...) {
87+
if (!is.null(self$.class_title)) {
88+
# Allow subclasses to override just printing the class name first
89+
class_title <- self$.class_title()
90+
} else {
91+
class_title <- class(self)[[1]]
92+
}
93+
cat(class_title, "\n", sep = "")
8694
if (!is.null(self$ToString)){
8795
cat(self$ToString(), "\n", sep = "")
8896
}

r/R/dataset.R

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,24 @@ dim.Dataset <- function(x) c(x$num_rows, x$num_cols)
153153
#' @rdname Dataset
154154
#' @export
155155
FileSystemDataset <- R6Class("FileSystemDataset", inherit = Dataset,
156+
public = list(
157+
.class_title = function() {
158+
nfiles <- length(self$files)
159+
file_type <- self$format$type
160+
pretty_file_type <- list(
161+
parquet = "Parquet",
162+
ipc = "Feather"
163+
)[[file_type]]
164+
165+
paste(
166+
class(self)[[1]],
167+
"with",
168+
nfiles,
169+
pretty_file_type %||% file_type,
170+
ifelse(nfiles == 1, "file", "files")
171+
)
172+
}
173+
),
156174
active = list(
157175
#' @description
158176
#' Return the files contained in this `FileSystemDataset`

r/R/dplyr.R

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,8 @@ dim.arrow_dplyr_query <- function(x) {
9090
}
9191

9292
# The following S3 methods are registered on load if dplyr is present
93+
tbl_vars.arrow_dplyr_query <- function(x) names(x$selected_columns)
94+
9395
select.arrow_dplyr_query <- function(.data, ...) {
9496
column_select(arrow_dplyr_query(.data), !!!enquos(...))
9597
}

r/tests/testthat/test-dataset.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -336,7 +336,7 @@ test_that("Dataset and query print methods", {
336336
expect_output(
337337
print(ds),
338338
paste(
339-
"Dataset",
339+
"FileSystemDataset with 2 Parquet files",
340340
"int: int32",
341341
"dbl: double",
342342
"lgl: bool",

r/vignettes/dataset.Rmd

Lines changed: 48 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -23,49 +23,65 @@ is widely used in big data exercises and competitions.
2323
For demonstration purposes, we have hosted a Parquet-formatted version
2424
of about 10 years of the trip data in a public S3 bucket.
2525

26+
The total file size is around 37 gigabytes, even in the efficient Parquet file format.
27+
That's bigger than memory on most people's computers,
28+
so we can't just read it all in and stack it into a single data frame.
29+
2630
In a future release, you'll be able to point your R session at S3 and query
2731
the dataset from there. For now, datasets need to be on your local file system.
2832
To download the files,
2933

30-
```r
34+
```{r, eval = FALSE}
3135
bucket <- "https://ursa-labs-taxi-data.s3.us-east-2.amazonaws.com"
3236
dir.create("nyc-taxi")
3337
for (year in 2009:2019) {
34-
for (month in 1:12) {
38+
dir.create(file.path("nyc-taxi", year))
39+
if (year == 2019) {
40+
# We only have through June 2019 there
41+
months <- 1:6
42+
} else {
43+
months <- 1:12
44+
}
45+
for (month in months) {
3546
if (month < 10) {
3647
month <- paste0("0", month)
3748
}
38-
try(download.file(
49+
dir.create(file.path("nyc-taxi", year, month))
50+
download.file(
3951
paste(bucket, year, month, "data.parquet", sep = "/"),
4052
file.path("nyc-taxi", year, month, "data.parquet")
41-
))
53+
)
4254
}
4355
}
4456
```
4557

46-
It is expected that some files will not download because they do not exist--December 2019,
47-
for example--hence the `try()`.
48-
The total file size is around 37 gigabytes, even in the efficient Parquet file format.
49-
That's bigger than memory on most people's computers,
50-
so we can't just read it all in and stack it into a single data frame.
51-
58+
Note that the vignette will not execute that code chunk: if you want to run
59+
with live data, you'll have to do it yourself separately.
5260
Given the size, if you're running this locally and don't have a fast connection,
5361
feel free to grab only a year or two of data.
5462

63+
If you don't have the taxi data downloaded, the vignette will still run and will
64+
yield previously cached output for reference. To be explicit about which version
65+
is running, let's check whether we're running with live data:
66+
67+
```{r}
68+
dir.exists("nyc-taxi")
69+
```
70+
5571
## Getting started
5672

5773
Because `dplyr` is not necessary for many Arrow workflows,
5874
it is an optional (`Suggests`) dependency. So, to work with Datasets,
5975
we need to load both `arrow` and `dplyr`.
6076

61-
```r
62-
library(arrow)
63-
library(dplyr)
77+
```{r}
78+
library(arrow, warn.conflicts = FALSE)
79+
library(dplyr, warn.conflicts = FALSE)
6480
```
6581

6682
The first step is to create our Dataset object, pointing at the directory of data.
6783

68-
```r
84+
```{r, eval = file.exists("nyc-taxi")}
6985
ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
7086
```
7187

@@ -92,10 +108,12 @@ and 1 for "month", even though those columns may not actually be present in the
92108
Indeed, when we look at the dataset, we see that in addition to the columns present
93109
in every file, there are also columns "year" and "month".
94110

95-
```
111+
```{r, eval = file.exists("nyc-taxi")}
96112
ds
97-
98-
## Dataset
113+
```
114+
```{r, echo = FALSE, eval = !file.exists("nyc-taxi")}
115+
cat("
116+
## FileSystemDataset with 125 Parquet files
99117
## vendor_id: string
100118
## pickup_at: timestamp[us]
101119
## dropoff_at: timestamp[us]
@@ -122,6 +140,7 @@ ds
122140
## month: int32
123141
124142
See $metadata for additional Schema metadata
143+
")
125144
```
126145

127146
The other form of partitioning currently supported is [Hive](https://hive.apache.org/)-style,
@@ -160,7 +179,7 @@ Here's an example. Suppose I was curious about tipping behavior among the
160179
longest taxi rides. Let's find the median tip percentage for rides with
161180
fares greater than $100 in 2015, broken down by the number of passengers:
162181

163-
```r
182+
```{r, eval = file.exists("nyc-taxi")}
164183
system.time(ds %>%
165184
filter(total_amount > 100, year == 2015) %>%
166185
select(tip_amount, total_amount, passenger_count) %>%
@@ -173,7 +192,8 @@ system.time(ds %>%
173192
print())
174193
```
175194

176-
```
195+
```{r, echo = FALSE, eval = !file.exists("nyc-taxi")}
196+
cat("
177197
## # A tibble: 10 x 3
178198
## passenger_count tip_pct n
179199
## <int> <dbl> <int>
@@ -189,31 +209,34 @@ system.time(ds %>%
189209
## 10 9 16.7 42
190210
##
191211
## user system elapsed
192-
## 25.227 1.162 3.767
212+
## 4.436 1.012 1.402
213+
")
193214
```
194215

195216
We just selected a window out of a dataset with around 2 billion rows
196-
and aggregated on it in under 4 seconds on my laptop. How does this work?
217+
and aggregated on it in under 2 seconds on my laptop. How does this work?
197218

198219
First, `select()`/`rename()`, `filter()`, and `group_by()`
199220
record their actions but don't evaluate on the data until you run `collect()`.
200221

201-
```r
222+
```{r, eval = file.exists("nyc-taxi")}
202223
ds %>%
203224
filter(total_amount > 100, year == 2015) %>%
204225
select(tip_amount, total_amount, passenger_count) %>%
205226
group_by(passenger_count)
206227
```
207228

208-
```
209-
## Dataset (query)
229+
```{r, echo = FALSE, eval = !file.exists("nyc-taxi")}
230+
cat("
231+
## FileSystemDataset (query)
210232
## tip_amount: float
211233
## total_amount: float
212234
## passenger_count: int8
213235
##
214236
## * Filter: ((total_amount > 100:double) and (year == 2015:double))
215237
## * Grouped by passenger_count
216238
## See $.data for the source Arrow object
239+
")
217240
```
218241

219242
This returns instantly and shows the window selection you've made, without
@@ -257,4 +280,4 @@ In the future, when there is support for cloud storage and other file formats,
257280
this would mean you could point to an S3 bucked of Parquet data and a directory
258281
of CSVs on the local file system and query them together as a single dataset.
259282
To create a multi-source dataset, provide a list of sources to `open_dataset()`
260-
instead of a file path. See `?open_source` for creating data sources.
283+
instead of a file path. See `?dataset_factory` for creating data sources.

0 commit comments

Comments
 (0)