Skip to content

Latest commit

 

History

History
94 lines (72 loc) · 3.67 KB

File metadata and controls

94 lines (72 loc) · 3.67 KB
.. default-domain:: cpp
.. cpp:namespace:: arrow::io

Input / output and filesystems

Arrow provides a range of C++ interfaces abstracting the concrete details of input / output operations. They operate on streams of untyped binary data. Those abstractions are used for various purposes such as reading CSV or Parquet data, transmitting IPC streams, and more.

.. seealso::
   :doc:`API reference for input/output facilities <api/io>`.

Reading binary data

Interfaces for reading binary data come in two flavours:

  • Sequential reading: the :class:`InputStream` interface provides Read methods; it is recommended to Read to a Buffer as it may in some cases avoid a memory copy.
  • Random access reading: the :class:`RandomAccessFile` interface provides additional facilities for positioning and, most importantly, the ReadAt methods which allow parallel reading from multiple threads.

Concrete implementations are available for :class:`in-memory reads <BufferReader>`, :class:`unbuffered file reads <ReadableFile>`, :class:`memory-mapped file reads <MemoryMappedFile>`, :class:`buffered reads <BufferedInputStream>`, :class:`compressed reads <CompressedInputStream>`.

Writing binary data

Writing binary data is mostly done through the :class:`OutputStream` interface.

Concrete implementations are available for :class:`in-memory writes <BufferOutputStream>`, :class:`unbuffered file writes <FileOutputStream>`, :class:`memory-mapped file writes <MemoryMappedFile>`, :class:`buffered writes <BufferedOutputStream>`, :class:`compressed writes <CompressedOutputStream>`.

.. cpp:namespace:: arrow::fs

Filesystems

The :class:`filesystem interface <FileSystem>` allows abstracted access over various data storage backends such as the local filesystem or a S3 bucket. It provides input and output streams as well as directory operations.

The filesystem interface exposes a simplified view of the underlying data storage. Data paths are represented as abstract paths, which are /-separated, even on Windows, and shouldn't include special path components such as . and ... Symbolic links, if supported by the underlying storage, are automatically dereferenced. Only basic :class:`metadata <FileStats>` about file entries, such as the file size and modification time, is made available.

Concrete implementations are available for :class:`local filesystem access <LocalFileSystem>`, :class:`HDFS <HadoopFileSystem>`, :class:`Amazon S3-compatible storage <S3FileSystem>` and :class:`Google Cloud Storage <GcsFileSystem>`.

Note

Tasks that use filesystems will typically run on the :ref:`I/O thread pool<io_thread_pool>`. For filesystems that support high levels of concurrency you may get a benefit from increasing the size of the I/O thread pool.