Skip to content

Commit b77e8ae

Browse files
ARROW-9854: [R] Support reading/writing data to/from S3
- [x] read_parquet/feather/etc. from S3 (use FileSystem->OpenInputFile(path)) - [x] write_$FORMAT via FileSystem->OpenOutputStream(path) - [x] write_dataset (done? at least via URI) - [x] ~~for linux, an argument to install_arrow to help, assuming you've installed aws-sdk-cpp already (turn on ARROW_S3, AWSSDK_SOURCE=SYSTEM)~~ Turns out there's no official deb/rpm packages for aws-sdk-cpp so there's no value in making this part easier; would be more confusing than helpful actually - [x] set up a real test bucket and user for e2e testing (credentials available on request) - [x] add a few tests that use s3, if credentials are set (which I'll set locally) - [x] add vignette showing how to use s3 (via URI) - [x] update docs, news Out of the current scope: - [ ] testing with minio on CI - [ ] download dataset, i.e. copy files/directory recursively (needs ARROW-9867, ARROW-9868) - [ ] friendlier methods for interacting with/viewing a filesystem (ls, mkdir, etc.) (ARROW-9870) - [ ] direct construction of S3FileSystem object with S3Options (i.e. not only URI) (ARROW-9869) Closes apache#8058 from nealrichardson/r-s3 Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
1 parent 986eab4 commit b77e8ae

20 files changed

Lines changed: 191 additions & 52 deletions

r/NEWS.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,17 +25,18 @@
2525
* Datasets now have `head()`, `tail()`, and take (`[`) methods. `head()` is optimized but the others may not be performant.
2626
* `collect()` gains an `as_data_frame` argument, default `TRUE` but when `FALSE` allows you to evaluate the accumulated `select` and `filter` query but keep the result in Arrow, not an R `data.frame`
2727

28+
## AWS S3 support
29+
30+
* S3 support is now enabled in binary macOS and Windows (Rtools40 only, i.e. R >= 4.0) packages. To enable it on Linux, you will need to build and install `aws-sdk-cpp` from source, then set the environment variable `EXTRA_CMAKE_FLAGS="-DARROW_S3=ON -DAWSSDK_SOURCE=SYSTEM"` prior to building the R package (with bundled C++ build, not with Arrow system libraries) from source.
31+
* File readers and writers (`read_parquet()`, `write_feather()`, et al.) now accept an `s3://` URI as the source or destination file, as do `open_dataset()` and `write_dataset()`. See `vignette("fs", package = "arrow")` for details.
32+
2833
## Computation
2934

3035
* Comparison (`==`, `>`, etc.) and boolean (`&`, `|`, `!`) operations, along with `is.na`, `%in%` and `match` (called `match_arrow()`), on Arrow Arrays and ChunkedArrays are now implemented in the C++ library.
3136
* Aggregation methods `min()`, `max()`, and `unique()` are implemented for Arrays and ChunkedArrays.
3237
* `dplyr` filter expressions on Arrow Tables and RecordBatches are now evaluated in the C++ library, rather than by pulling data into R and evaluating. This yields significant performance improvements.
3338
* `dim()` (`nrow`) for dplyr queries on Table/RecordBatch is now supported
3439

35-
## Packaging
36-
37-
* S3 support is now enabled in binary macOS and Windows (Rtools40 only, i.e. R >= 4.0) packages
38-
3940
## Other improvements
4041

4142
* `arrow` now depends on [`cpp11`](https://cpp11.r-lib.org/), which brings more robust UTF-8 handling and faster compilation

r/R/csv.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@
3232
#' `parse_options`, `convert_options`, or `read_options` arguments, or you can
3333
#' use [CsvTableReader] directly for lower-level access.
3434
#'
35-
#' @param file A character file name, `raw` vector, or an Arrow input stream.
35+
#' @param file A character file name or URI, `raw` vector, or an Arrow input stream.
3636
#' If a file name, a memory-mapped Arrow [InputStream] will be opened and
3737
#' closed when finished; compression will be detected from the file extension
3838
#' and handled automatically. If an input stream is provided, it will be left

r/R/dataset-factory.R

Lines changed: 3 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -48,17 +48,8 @@ DatasetFactory$create <- function(x,
4848
stop("'x' must be a string or a list of DatasetFactory", call. = FALSE)
4949
}
5050

51-
if (!inherits(filesystem, "FileSystem")) {
52-
if (grepl("://", x)) {
53-
fs_from_uri <- FileSystem$from_uri(x)
54-
filesystem <- fs_from_uri$fs
55-
x <- fs_from_uri$path
56-
} else {
57-
filesystem <- LocalFileSystem$create()
58-
x <- clean_path_abs(x)
59-
}
60-
}
61-
selector <- FileSelector$create(x, allow_not_found = FALSE, recursive = TRUE)
51+
path_and_fs <- get_path_and_filesystem(x, filesystem)
52+
selector <- FileSelector$create(path_and_fs$path, allow_not_found = FALSE, recursive = TRUE)
6253

6354
if (is.character(format)) {
6455
format <- FileFormat$create(match.arg(format), ...)
@@ -74,7 +65,7 @@ DatasetFactory$create <- function(x,
7465
partitioning <- DirectoryPartitioningFactory$create(partitioning)
7566
}
7667
}
77-
FileSystemDatasetFactory$create(filesystem, selector, format, partitioning)
68+
FileSystemDatasetFactory$create(path_and_fs$fs, selector, format, partitioning)
7869
}
7970

8071
#' Create a DatasetFactory

r/R/dataset.R

Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -164,17 +164,8 @@ Dataset <- R6Class("Dataset", inherit = ArrowObject,
164164
NewScan = function() unique_ptr(ScannerBuilder, dataset___Dataset__NewScan(self)),
165165
ToString = function() self$schema$ToString(),
166166
write = function(path, filesystem = NULL, schema = self$schema, format, partitioning, ...) {
167-
if (!inherits(filesystem, "FileSystem")) {
168-
if (grepl("://", path)) {
169-
fs_from_uri <- FileSystem$from_uri(path)
170-
filesystem <- fs_from_uri$fs
171-
path <- fs_from_uri$path
172-
} else {
173-
filesystem <- LocalFileSystem$create()
174-
path <- clean_path_abs(path)
175-
}
176-
}
177-
dataset___Dataset__Write(self, schema, format, filesystem, path, partitioning)
167+
path_and_fs <- get_path_and_filesystem(path, filesystem)
168+
dataset___Dataset__Write(self, schema, format, path_and_fs$fs, path_and_fs$path, partitioning)
178169
invisible(self)
179170
}
180171
),

r/R/feather.R

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
#' and the version 2 specification, which is the Apache Arrow IPC file format.
2525
#'
2626
#' @param x `data.frame`, [RecordBatch], or [Table]
27-
#' @param sink A string file path or [OutputStream]
27+
#' @param sink A string file path, URI, or [OutputStream]
2828
#' @param version integer Feather file version. Version 2 is the current.
2929
#' Version 1 is the more limited legacy format.
3030
#' @param chunk_size For V2 files, the number of rows that each chunk of data
@@ -106,7 +106,7 @@ write_feather <- function(x,
106106
assert_is(x, "Table")
107107

108108
if (is.string(sink)) {
109-
sink <- FileOutputStream$create(sink)
109+
sink <- make_output_stream(sink)
110110
on.exit(sink$close())
111111
}
112112
assert_is(sink, "OutputStream")
@@ -142,7 +142,7 @@ write_feather <- function(x,
142142
#' df <- read_feather(tf, col_select = starts_with("d"))
143143
#' }
144144
read_feather <- function(file, col_select = NULL, as_data_frame = TRUE, ...) {
145-
if (!inherits(file, "InputStream")) {
145+
if (!inherits(file, "RandomAccessFile")) {
146146
file <- make_readable_file(file)
147147
on.exit(file$close())
148148
}

r/R/filesystem.R

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -228,7 +228,7 @@ FileSystem <- R6Class("FileSystem", inherit = ArrowObject,
228228
shared_ptr(InputStream, fs___FileSystem__OpenInputStream(self, clean_path_rel(path)))
229229
},
230230
OpenInputFile = function(path) {
231-
shared_ptr(InputStream, fs___FileSystem__OpenInputFile(self, clean_path_rel(path)))
231+
shared_ptr(RandomAccessFile, fs___FileSystem__OpenInputFile(self, clean_path_rel(path)))
232232
},
233233
OpenOutputStream = function(path) {
234234
shared_ptr(OutputStream, fs___FileSystem__OpenOutputStream(self, clean_path_rel(path)))
@@ -242,11 +242,31 @@ FileSystem <- R6Class("FileSystem", inherit = ArrowObject,
242242
)
243243
)
244244
FileSystem$from_uri <- function(uri) {
245+
assert_that(is.string(uri))
245246
out <- fs___FileSystemFromUri(uri)
246247
out$fs <- shared_ptr(FileSystem, out$fs)$..dispatch()
247248
out
248249
}
249250

251+
get_path_and_filesystem <- function(x, filesystem = NULL) {
252+
# Wrapper around FileSystem$from_uri that handles local paths
253+
# and an optional explicit filesystem
254+
assert_that(is.string(x))
255+
if (is_url(x)) {
256+
if (!is.null(filesystem)) {
257+
# Stop? Can't have URL (which yields a fs) and another fs
258+
}
259+
FileSystem$from_uri(x)
260+
} else {
261+
list(
262+
fs = filesystem %||% LocalFileSystem$create(),
263+
path = clean_path_abs(x)
264+
)
265+
}
266+
}
267+
268+
is_url <- function(x) grepl("://", x)
269+
250270
#' @usage NULL
251271
#' @format NULL
252272
#' @rdname FileSystem

r/R/io.R

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -224,15 +224,25 @@ mmap_open <- function(path, mode = c("read", "write", "readwrite")) {
224224
#' with this compression codec, either a [Codec] or the string name of one.
225225
#' If `NULL` (default) and `file` is a string file name, the function will try
226226
#' to infer compression from the file extension.
227+
#' @param filesystem If not `NULL`, `file` will be opened via the
228+
#' `filesystem$OpenInputFile()` filesystem method, rather than the `io` module's
229+
#' `MemoryMappedFile` or `ReadableFile` constructors.
227230
#' @return An `InputStream` or a subclass of one.
228231
#' @keywords internal
229-
make_readable_file <- function(file, mmap = TRUE, compression = NULL) {
232+
make_readable_file <- function(file, mmap = TRUE, compression = NULL, filesystem = NULL) {
230233
if (is.string(file)) {
234+
if (is_url(file)) {
235+
fs_and_path <- FileSystem$from_uri(file)
236+
filesystem <- fs_and_path$fs
237+
file <- fs_and_path$path
238+
}
231239
if (is.null(compression)) {
232240
# Infer compression from the file path
233241
compression <- detect_compression(file)
234242
}
235-
if (isTRUE(mmap)) {
243+
if (!is.null(filesystem)) {
244+
file <- filesystem$OpenInputFile(file)
245+
} else if (isTRUE(mmap)) {
236246
file <- mmap_open(file)
237247
} else {
238248
file <- ReadableFile$create(file)
@@ -247,6 +257,15 @@ make_readable_file <- function(file, mmap = TRUE, compression = NULL) {
247257
file
248258
}
249259

260+
make_output_stream <- function(x) {
261+
if (is_url(x)) {
262+
fs_and_path <- FileSystem$from_uri(x)
263+
fs_and_path$fs$OpenOutputStream(fs_and_path$path)
264+
} else {
265+
FileOutputStream$create(x)
266+
}
267+
}
268+
250269
detect_compression <- function(path) {
251270
assert_that(is.string(path))
252271
switch(tools::file_ext(path),

r/R/ipc_stream.R

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ write_ipc_stream <- function(x, sink, ...) {
4141
x <- Table$create(x)
4242
}
4343
if (is.string(sink)) {
44-
sink <- FileOutputStream$create(sink)
44+
sink <- make_output_stream(sink)
4545
on.exit(sink$close())
4646
}
4747
assert_is(sink, "OutputStream")
@@ -82,10 +82,10 @@ write_to_raw <- function(x, format = c("stream", "file")) {
8282
#' `read_arrow()`, a wrapper around `read_ipc_stream()` and `read_feather()`,
8383
#' is deprecated. You should explicitly choose
8484
#' the function that will read the desired IPC format (stream or file) since
85-
#' a file or `InputStream` may contain either.
85+
#' a file or `InputStream` may contain either.
8686
#'
87-
#' @param file A character file name, `raw` vector, or an Arrow input stream.
88-
#' If a file name, a memory-mapped Arrow [InputStream] will be opened and
87+
#' @param file A character file name or URI, `raw` vector, or an Arrow input stream.
88+
#' If a file name or URI, an Arrow [InputStream] will be opened and
8989
#' closed when finished. If an input stream is provided, it will be left
9090
#' open.
9191
#' @param as_data_frame Should the function return a `data.frame` (default) or

r/R/parquet.R

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,8 @@ read_parquet <- function(file,
5959
#' This function enables you to write Parquet files from R.
6060
#'
6161
#' @param x An [arrow::Table][Table], or an object convertible to it.
62-
#' @param sink an [arrow::io::OutputStream][OutputStream] or a string which is interpreted as a file path
62+
#' @param sink an [arrow::io::OutputStream][OutputStream] or a string
63+
#' interpreted as a file path or URI
6364
#' @param chunk_size chunk size in number of rows. If NULL, the total number of rows is used.
6465
#' @param version parquet version, "1.0" or "2.0". Default "1.0". Numeric values
6566
#' are coerced to character.
@@ -129,7 +130,7 @@ write_parquet <- function(x,
129130
}
130131

131132
if (is.string(sink)) {
132-
sink <- FileOutputStream$create(sink)
133+
sink <- make_output_stream(sink)
133134
on.exit(sink$close())
134135
} else if (!inherits(sink, "OutputStream")) {
135136
abort("sink must be a file path or an OutputStream")

r/man/make_readable_file.Rd

Lines changed: 5 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)