.. default-domain:: cpp.. cpp:namespace:: arrow::json
Line-separated JSON files can either be read as a single Arrow Table with a :class:`~TableReader` or streamed as RecordBatches with a :class:`~StreamingReader`.
Both of these readers require an :class:`arrow::io::InputStream` instance representing the input file. Their behavior can be customized using a combination of :class:`~ReadOptions`, :class:`~ParseOptions`, and other parameters.
.. seealso::
:ref:`JSON reader API reference <cpp-api-json>`.
:class:`~TableReader` reads an entire file in one shot as a :class:`~arrow::Table`. Each independent JSON object in the input file is converted to a row in the output table.
#include "arrow/json/api.h"
{
// ...
arrow::Status st;
arrow::MemoryPool* pool = default_memory_pool();
std::shared_ptr<arrow::io::InputStream> input = ...;
auto read_options = arrow::json::ReadOptions::Defaults();
auto parse_options = arrow::json::ParseOptions::Defaults();
// Instantiate TableReader from input stream and options
std::shared_ptr<arrow::json::TableReader> reader;
st = arrow::json::TableReader::Make(pool, input, read_options,
parse_options, &reader);
if (!st.ok()) {
// Handle TableReader instantiation error...
}
std::shared_ptr<arrow::Table> table;
// Read table from JSON file
st = reader->Read(&table);
if (!st.ok()) {
// Handle JSON read error
// (for example a JSON syntax error or failed type conversion)
}
}:class:`~StreamingReader` reads a file incrementally from blocks of a roughly equal byte size, each yielding a :class:`~arrow::RecordBatch`. Each independent JSON object in a block is converted to a row in the output batch.
All batches adhere to a consistent :class:`~arrow::Schema`, which is derived from the first loaded batch. Alternatively, an explicit schema may be passed via :class:`~ParseOptions`.
#include "arrow/json/api.h"
{
// ...
auto read_options = arrow::json::ReadOptions::Defaults();
auto parse_options = arrow::json::ParseOptions::Defaults();
std::shared_ptr<arrow::io::InputStream> stream;
auto result = arrow::json::StreamingReader::Make(stream,
read_options,
parse_options);
if (!result.ok()) {
// Handle instantiation error
}
std::shared_ptr<arrow::json::StreamingReader> reader = *result;
for (arrow::Result<std::shared_ptr<arrow::RecordBatch>> maybe_batch : *reader) {
if (!maybe_batch.ok()) {
// Handle read/parse error
}
std::shared_ptr<arrow::RecordBatch> batch = *maybe_batch;
// Operate on each batch...
}
}Since JSON values are typed, the possible Arrow data types on output depend on the input value types. Top-level JSON values should always be objects. The fields of top-level objects are taken to represent columns in the Arrow data. For each name/value pair in a JSON object, there are two possible modes of deciding the output data type:
- if the name is in :member:`ParseOptions::explicit_schema`, conversion of the JSON value to the corresponding Arrow data type is attempted;
- otherwise, the Arrow data type is determined via type inference on the JSON value, trying out a number of Arrow data types in order.
The following tables show the possible combinations for each of those two modes.
| JSON value type | Allowed Arrow data types |
|---|---|
| Null | Any (including Null) |
| Number | All Integer types, Float32, Float64, Date32, Date64, Time32, Time64 |
| Boolean | Boolean |
| String | Binary, LargeBinary, String, LargeString, Timestamp |
| Array | List |
| Object (nested) | Struct |
| JSON value type | Inferred Arrow data types (in order) |
|---|---|
| Null | Null, any other |
| Number | Int64, Float64 |
| Boolean | Boolean |
| String | Timestamp (with seconds unit), String |
| Array | List |
| Object (nested) | Struct |