.. currentmodule:: pyarrow
Arrow defines two types of binary formats for serializing record batches:
- Streaming format: for sending an arbitrary length sequence of record batches. The format must be processed from start to end, and does not support random access
- File or Random Access format: for serializing a fixed number of record batches. Supports random access, and thus is very useful when used with memory maps
To follow this section, make sure to first read the section on :ref:`Memory and IO <io>`.
First, let's create a small record batch:
.. ipython:: python
import pyarrow as pa
data = [
pa.array([1, 2, 3, 4]),
pa.array(['foo', 'bar', 'baz', None]),
pa.array([True, None, False, True])
]
batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2'])
batch.num_rows
batch.num_columns
Now, we can begin writing a stream containing some number of these batches. For
this we use :class:`~pyarrow.RecordBatchStreamWriter`, which can write to a writeable
NativeFile object or a writeable Python object:
.. ipython:: python sink = pa.BufferOutputStream() writer = pa.RecordBatchStreamWriter(sink, batch.schema)
Here we used an in-memory Arrow buffer stream, but this could have been a socket or some other IO sink.
When creating the StreamWriter, we pass the schema, since the schema
(column names and types) must be the same for all of the batches sent in this
particular stream. Now we can do:
.. ipython:: python
for i in range(5):
writer.write_batch(batch)
writer.close()
buf = sink.get_result()
buf.size
Now buf contains the complete stream as an in-memory byte buffer. We can
read such a stream with :class:`~pyarrow.RecordBatchStreamReader` or the
convenience function pyarrow.open_stream:
.. ipython:: python reader = pa.open_stream(buf) reader.schema batches = [b for b in reader] len(batches)
We can check the returned batches are the same as the original input:
.. ipython:: python batches[0].equals(batch)
An important point is that if the input source supports zero-copy reads
(e.g. like a memory map, or pyarrow.BufferReader), then the returned
batches are also zero-copy and do not allocate any new memory on read.
The :class:`~pyarrow.RecordBatchFileWriter` has the same API as :class:`~pyarrow.RecordBatchStreamWriter`:
.. ipython:: python
sink = pa.BufferOutputStream()
writer = pa.RecordBatchFileWriter(sink, batch.schema)
for i in range(10):
writer.write_batch(batch)
writer.close()
buf = sink.get_result()
buf.size
The difference between :class:`~pyarrow.RecordBatchFileReader` and
:class:`~pyarrow.RecordBatchStreamReader` is that the input source must have a
seek method for random access. The stream reader only requires read
operations. We can also use the pyarrow.open_file method to open a file:
.. ipython:: python reader = pa.open_file(buf)
Because we have access to the entire payload, we know the number of record batches in the file, and can read any at random:
.. ipython:: python reader.num_record_batches b = reader.get_batch(3) b.equals(batch)
The stream and file reader classes have a special read_pandas method to
simplify reading multiple record batches and converting them to a single
DataFrame output:
.. ipython:: python df = pa.open_file(buf).read_pandas() df[:5]
In pyarrow we are able to serialize and deserialize many kinds of Python
objects. While not a complete replacement for the pickle module, these
functions can be significantly faster, particular when dealing with collections
of NumPy arrays.
As an example, consider a dictionary containing NumPy arrays:
.. ipython:: python
import numpy as np
data = {
i: np.random.randn(500, 500)
for i in range(100)
}
We use the pyarrow.serialize function to convert this data to a byte
buffer:
.. ipython:: python buf = pa.serialize(data).to_buffer() type(buf) buf.size
pyarrow.serialize creates an intermediate object which can be converted to
a buffer (the to_buffer method) or written directly to an output stream.
pyarrow.deserialize converts a buffer-like object back to the original
Python object:
.. ipython:: python restored_data = pa.deserialize(buf) restored_data[0]
When dealing with NumPy arrays, pyarrow.deserialize can be significantly
faster than pickle because the resulting arrays are zero-copy references
into the input buffer. The larger the arrays, the larger the performance
savings.
Consider this example, we have for pyarrow.deserialize
.. ipython:: python %timeit restored_data = pa.deserialize(buf)
And for pickle:
.. ipython:: python import pickle pickled = pickle.dumps(data) %timeit unpickled_data = pickle.loads(pickled)
We aspire to make these functions a high-speed alternative to pickle for transient serialization in Python big data applications.
If an unrecognized data type is encountered when serializing an object,
pyarrow will fall back on using pickle for converting that type to a
byte string. There may be a more efficient way, though.
Consider a class with two members, one of which is a NumPy array:
class MyData:
def __init__(self, name, data):
self.name = name
self.data = dataWe write functions to convert this to and from a dictionary with simpler types:
def _serialize_MyData(val):
return {'name': val.name, 'data': val.data}
def _deserialize_MyData(data):
return MyData(data['name'], data['data']then, we must register these functions in a SerializationContext so that
MyData can be recognized:
context = pa.SerializationContext()
context.register_type(MyData, 'MyData',
custom_serializer=_serialize_MyData,
custom_deserializer=_deserialize_MyData)Lastly, we use this context as an additioanl argument to pyarrow.serialize:
buf = pa.serialize(val, context=context).to_buffer()
restored_val = pa.deserialize(buf, context=context)Feather is a lightweight file-format for data frames that uses the Arrow memory layout for data representation on disk. It was created early in the Arrow project as a proof of concept for fast, language-agnostic data frame storage for Python (pandas) and R.
Compared with Arrow streams and files, Feather has some limitations:
- Only non-nested data types and categorical (dictionary-encoded) types are supported
- Supports only a single batch of rows, where general Arrow streams support an arbitrary number
- Supports limited scalar value types, adequate only for representing typical data found in R and pandas
We would like to continue to innovate in the Feather format, but we must wait for an R implementation for Arrow to mature.
The pyarrow.feather module contains the read and write functions for the
format. The input and output are pandas.DataFrame objects:
import pyarrow.feather as feather
feather.write_feather(df, '/path/to/file')
read_df = feather.read_feather('/path/to/file')read_feather supports multithreaded reads, and may yield faster performance
on some files:
read_df = feather.read_feather('/path/to/file', nthreads=4)These functions can read and write with file-like objects. For example:
with open('/path/to/file', 'wb') as f:
feather.write_feather(df, f)
with open('/path/to/file', 'rb') as f:
read_df = feather.read_feather(f)A file input to read_feather must support seeking.