@@ -23,49 +23,65 @@ is widely used in big data exercises and competitions.
2323For demonstration purposes, we have hosted a Parquet-formatted version
2424of about 10 years of the trip data in a public S3 bucket.
2525
26+ The total file size is around 37 gigabytes, even in the efficient Parquet file format.
27+ That's bigger than memory on most people's computers,
28+ so we can't just read it all in and stack it into a single data frame.
29+
2630In a future release, you'll be able to point your R session at S3 and query
2731the dataset from there. For now, datasets need to be on your local file system.
2832To download the files,
2933
30- ``` r
34+ ``` {r, eval = FALSE}
3135bucket <- "https://ursa-labs-taxi-data.s3.us-east-2.amazonaws.com"
3236dir.create("nyc-taxi")
3337for (year in 2009:2019) {
34- for (month in 1 : 12 ) {
38+ dir.create(file.path("nyc-taxi", year))
39+ if (year == 2019) {
40+ # We only have through June 2019 there
41+ months <- 1:6
42+ } else {
43+ months <- 1:12
44+ }
45+ for (month in months) {
3546 if (month < 10) {
3647 month <- paste0("0", month)
3748 }
38- try(download.file(
49+ dir.create(file.path("nyc-taxi", year, month))
50+ download.file(
3951 paste(bucket, year, month, "data.parquet", sep = "/"),
4052 file.path("nyc-taxi", year, month, "data.parquet")
41- ))
53+ )
4254 }
4355}
4456```
4557
46- It is expected that some files will not download because they do not exist--December 2019,
47- for example--hence the ` try() ` .
48- The total file size is around 37 gigabytes, even in the efficient Parquet file format.
49- That's bigger than memory on most people's computers,
50- so we can't just read it all in and stack it into a single data frame.
51-
58+ Note that the vignette will not execute that code chunk: if you want to run
59+ with live data, you'll have to do it yourself separately.
5260Given the size, if you're running this locally and don't have a fast connection,
5361feel free to grab only a year or two of data.
5462
63+ If you don't have the taxi data downloaded, the vignette will still run and will
64+ yield previously cached output for reference. To be explicit about which version
65+ is running, let's check whether we're running with live data:
66+
67+ ``` {r}
68+ dir.exists("nyc-taxi")
69+ ```
70+
5571## Getting started
5672
5773Because ` dplyr ` is not necessary for many Arrow workflows,
5874it is an optional (` Suggests ` ) dependency. So, to work with Datasets,
5975we need to load both ` arrow ` and ` dplyr ` .
6076
61- ``` r
62- library(arrow )
63- library(dplyr )
77+ ``` {r}
78+ library(arrow, warn.conflicts = FALSE )
79+ library(dplyr, warn.conflicts = FALSE )
6480```
6581
6682The first step is to create our Dataset object, pointing at the directory of data.
6783
68- ``` r
84+ ``` {r, eval = file.exists("nyc-taxi")}
6985ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
7086```
7187
@@ -92,10 +108,12 @@ and 1 for "month", even though those columns may not actually be present in the
92108Indeed, when we look at the dataset, we see that in addition to the columns present
93109in every file, there are also columns "year" and "month".
94110
95- ```
111+ ``` {r, eval = file.exists("nyc-taxi")}
96112ds
97-
98- ## Dataset
113+ ```
114+ ``` {r, echo = FALSE, eval = !file.exists("nyc-taxi")}
115+ cat("
116+ ## FileSystemDataset with 125 Parquet files
99117## vendor_id: string
100118## pickup_at: timestamp[us]
101119## dropoff_at: timestamp[us]
122140## month: int32
123141
124142See $metadata for additional Schema metadata
143+ ")
125144```
126145
127146The other form of partitioning currently supported is [ Hive] ( https://hive.apache.org/ ) -style,
@@ -160,7 +179,7 @@ Here's an example. Suppose I was curious about tipping behavior among the
160179longest taxi rides. Let's find the median tip percentage for rides with
161180fares greater than $100 in 2015, broken down by the number of passengers:
162181
163- ``` r
182+ ``` {r, eval = file.exists("nyc-taxi")}
164183system.time(ds %>%
165184 filter(total_amount > 100, year == 2015) %>%
166185 select(tip_amount, total_amount, passenger_count) %>%
@@ -173,7 +192,8 @@ system.time(ds %>%
173192 print())
174193```
175194
176- ```
195+ ``` {r, echo = FALSE, eval = !file.exists("nyc-taxi")}
196+ cat("
177197## # A tibble: 10 x 3
178198## passenger_count tip_pct n
179199## <int> <dbl> <int>
@@ -189,31 +209,34 @@ system.time(ds %>%
189209## 10 9 16.7 42
190210##
191211## user system elapsed
192- ## 25.227 1.162 3.767
212+ ## 4.436 1.012 1.402
213+ ")
193214```
194215
195216We just selected a window out of a dataset with around 2 billion rows
196- and aggregated on it in under 4 seconds on my laptop. How does this work?
217+ and aggregated on it in under 2 seconds on my laptop. How does this work?
197218
198219First, ` select() ` /` rename() ` , ` filter() ` , and ` group_by() `
199220record their actions but don't evaluate on the data until you run ` collect() ` .
200221
201- ``` r
222+ ``` {r, eval = file.exists("nyc-taxi")}
202223ds %>%
203224 filter(total_amount > 100, year == 2015) %>%
204225 select(tip_amount, total_amount, passenger_count) %>%
205226 group_by(passenger_count)
206227```
207228
208- ```
209- ## Dataset (query)
229+ ``` {r, echo = FALSE, eval = !file.exists("nyc-taxi")}
230+ cat("
231+ ## FileSystemDataset (query)
210232## tip_amount: float
211233## total_amount: float
212234## passenger_count: int8
213235##
214236## * Filter: ((total_amount > 100:double) and (year == 2015:double))
215237## * Grouped by passenger_count
216238## See $.data for the source Arrow object
239+ ")
217240```
218241
219242This returns instantly and shows the window selection you've made, without
@@ -257,4 +280,4 @@ In the future, when there is support for cloud storage and other file formats,
257280this would mean you could point to an S3 bucked of Parquet data and a directory
258281of CSVs on the local file system and query them together as a single dataset.
259282To create a multi-source dataset, provide a list of sources to ` open_dataset() `
260- instead of a file path. See ` ?open_source ` for creating data sources.
283+ instead of a file path. See ` ?dataset_factory ` for creating data sources.
0 commit comments