Skip to content

Commit bc987cd

Browse files
jorgecarleitaoandygrove
authored andcommitted
ARROW-9922: [Rust] Add StructArray::TryFrom (+40%)
The core problem that this PR addresses is the construction of a `StructArray`, whose spec can be found [here](https://arrow.apache.org/docs/format/Columnar.html#struct-layout). The current API to build a `StructArray` of 4 entries of fixed type is (part of a test): ```rust let string_builder = StringBuilder::new(4); let int_builder = Int32Builder::new(4); let mut fields = Vec::new(); let mut field_builders = Vec::new(); fields.push(Field::new("f1", DataType::Utf8, false)); field_builders.push(Box::new(string_builder) as Box<ArrayBuilder>); fields.push(Field::new("f2", DataType::Int32, false)); field_builders.push(Box::new(int_builder) as Box<ArrayBuilder>); let mut builder = StructBuilder::new(fields, field_builders); assert_eq!(2, builder.num_fields()); let string_builder = builder .field_builder::<StringBuilder>(0) .expect("builder at field 0 should be string builder"); string_builder.append_value("joe").unwrap(); string_builder.append_null().unwrap(); string_builder.append_null().unwrap(); string_builder.append_value("mark").unwrap(); let int_builder = builder .field_builder::<Int32Builder>(1) .expect("builder at field 1 should be int builder"); int_builder.append_value(1).unwrap(); int_builder.append_value(2).unwrap(); int_builder.append_null().unwrap(); int_builder.append_value(4).unwrap(); builder.append(true).unwrap(); builder.append(true).unwrap(); builder.append_null().unwrap(); builder.append(true).unwrap(); let arr = builder.finish(); ``` This PR's proposal for the same array: ```rust let strings: ArrayRef = Arc::new(StringArray::from(vec![ Some("joe"), None, None, Some("mark"), ])); let ints: ArrayRef = Arc::new(Int32Array::from(vec![Some(1), Some(2), None, Some(4)])); let arr = StructArray::try_from(vec![("f1", strings.clone()), ("f2", ints.clone())]).unwrap(); ``` Note that: * There is no `Field`, only name: the attributes (type and nullability) are obtained from the `ArrayData`'s itself, and thus there a guarantee that the field's attributes are aligned with the Data. * The implementation is dynamically typed: the type is obtained from `Array::data_type`, instead of having to match Field's datatype to each field' builders * `Option` is used to specify whether the quantity is null or not The construction uses an OR on the entry's null bitmaps to decide whether the struct null bitmap is null at a given index. I.e. the third index of the example in [the spec](https://arrow.apache.org/docs/format/Columnar.html#struct-layout) is obtained by checking if all fields are null at that index. There is an edge case, that this constructor is unable to build (and the user needs to use the other `From`): a struct with a `0` at position X and all field's bitmap at position X to be `1`: ``` # array of 1 entry: bitmap struct = [0] bitmap field1 = [1] bitmap field2 = [1] ``` this is because, in this `TryFrom`, the bitmap of the struct is computed from a bitwise `or` of the field's entries. IMO this is a non-issue because a `null` in the struct already implies an `unspecified` value on every field and thus that field's value is already assumed to be undefined. However, this is important to mention as a round-trip with this case will fail: in the example above, `bitmap struct` will have a `1`. Finally, this has a performance improvement of 40%. <details> <summary>Benchmark results</summary> ``` git checkout HEAD^ && cargo bench --bench array_from_vec -- struct_array_from_vec && git checkout no_builder1 && cargo bench --bench array_from_vec -- struct_array_from_vec ``` ``` struct_array_from_vec 128 time: [7.7464 us 7.7586 us 7.7731 us] change: [-39.227% -38.313% -37.128%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 5 (5.00%) high mild 6 (6.00%) high severe struct_array_from_vec 256 time: [9.3386 us 9.3611 us 9.3896 us] change: [-45.035% -44.498% -43.914%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 1 (1.00%) low mild 4 (4.00%) high mild 7 (7.00%) high severe struct_array_from_vec 512 time: [13.107 us 13.148 us 13.199 us] change: [-49.213% -48.705% -48.208%] (p = 0.00 < 0.05) Performance has improved. Found 16 outliers among 100 measurements (16.00%) 2 (2.00%) low mild 4 (4.00%) high mild 10 (10.00%) high severe struct_array_from_vec 1024 time: [20.036 us 20.061 us 20.087 us] change: [-54.254% -53.479% -52.776%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) high mild 6 (6.00%) high severe ``` </details> ## Final note: The general direction that I am heading with this is to minimize the usage of builders. My issue with builders is that they are statically typed and perform incremental changes, but almost all our operations are dynamically typed and in bulk: batch read, batch write, etc. As such, it is often faster (and much simpler from UX's perspective) to create a `Vec<Option<_>>` and use it to create an Arrow Array. FYI @nevi-me @andygrove @alamb Closes apache#8118 from jorgecarleitao/no_builder1 Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove@nvidia.com>
1 parent 8cb15e6 commit bc987cd

3 files changed

Lines changed: 258 additions & 97 deletions

File tree

rust/arrow/benches/array_from_vec.rs

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ extern crate arrow;
2424
use arrow::array::*;
2525
use arrow::buffer::Buffer;
2626
use arrow::datatypes::*;
27+
use std::{convert::TryFrom, sync::Arc};
2728

2829
fn array_from_vec(n: usize) {
2930
let mut v: Vec<u8> = Vec::with_capacity(n);
@@ -48,6 +49,38 @@ fn array_string_from_vec(n: usize) {
4849
criterion::black_box(StringArray::from(v));
4950
}
5051

52+
fn struct_array_values(
53+
n: usize,
54+
) -> (
55+
&'static str,
56+
Vec<Option<&'static str>>,
57+
&'static str,
58+
Vec<Option<i32>>,
59+
) {
60+
let mut strings: Vec<Option<&str>> = Vec::with_capacity(n);
61+
let mut ints: Vec<Option<i32>> = Vec::with_capacity(n);
62+
for _ in 0..n / 4 {
63+
strings.extend_from_slice(&[Some("joe"), None, None, Some("mark")]);
64+
ints.extend_from_slice(&[Some(1), Some(2), None, Some(4)]);
65+
}
66+
("f1", strings, "f2", ints)
67+
}
68+
69+
fn struct_array_from_vec(
70+
field1: &str,
71+
strings: &Vec<Option<&str>>,
72+
field2: &str,
73+
ints: &Vec<Option<i32>>,
74+
) {
75+
let strings: ArrayRef = Arc::new(StringArray::from(strings.clone()));
76+
let ints: ArrayRef = Arc::new(Int32Array::from(ints.clone()));
77+
78+
criterion::black_box(
79+
StructArray::try_from(vec![(field1.clone(), strings), (field2.clone(), ints)])
80+
.unwrap(),
81+
);
82+
}
83+
5184
fn criterion_benchmark(c: &mut Criterion) {
5285
c.bench_function("array_from_vec 128", |b| b.iter(|| array_from_vec(128)));
5386
c.bench_function("array_from_vec 256", |b| b.iter(|| array_from_vec(256)));
@@ -62,6 +95,26 @@ fn criterion_benchmark(c: &mut Criterion) {
6295
c.bench_function("array_string_from_vec 512", |b| {
6396
b.iter(|| array_string_from_vec(512))
6497
});
98+
99+
let (field1, strings, field2, ints) = struct_array_values(128);
100+
c.bench_function("struct_array_from_vec 128", |b| {
101+
b.iter(|| struct_array_from_vec(&field1, &strings, &field2, &ints))
102+
});
103+
104+
let (field1, strings, field2, ints) = struct_array_values(256);
105+
c.bench_function("struct_array_from_vec 256", |b| {
106+
b.iter(|| struct_array_from_vec(&field1, &strings, &field2, &ints))
107+
});
108+
109+
let (field1, strings, field2, ints) = struct_array_values(512);
110+
c.bench_function("struct_array_from_vec 512", |b| {
111+
b.iter(|| struct_array_from_vec(&field1, &strings, &field2, &ints))
112+
});
113+
114+
let (field1, strings, field2, ints) = struct_array_values(1024);
115+
c.bench_function("struct_array_from_vec 1024", |b| {
116+
b.iter(|| struct_array_from_vec(&field1, &strings, &field2, &ints))
117+
});
65118
}
66119

67120
criterion_group!(benches, criterion_benchmark);

rust/arrow/src/array/array.rs

Lines changed: 166 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
// under the License.
1717

1818
use std::any::Any;
19-
use std::convert::From;
19+
use std::convert::{From, TryFrom};
2020
use std::fmt;
2121
use std::io::Write;
2222
use std::iter::{FromIterator, IntoIterator};
@@ -28,11 +28,14 @@ use chrono::prelude::*;
2828
use super::*;
2929
use crate::array::builder::StringDictionaryBuilder;
3030
use crate::array::equal::JsonEqual;
31-
use crate::buffer::{Buffer, MutableBuffer};
31+
use crate::buffer::{buffer_bin_or, Buffer, MutableBuffer};
3232
use crate::datatypes::DataType::Struct;
3333
use crate::datatypes::*;
3434
use crate::memory;
35-
use crate::util::bit_util;
35+
use crate::{
36+
error::{ArrowError, Result},
37+
util::bit_util,
38+
};
3639

3740
/// Number of seconds in a day
3841
const SECONDS_IN_DAY: i64 = 86_400;
@@ -360,6 +363,13 @@ fn slice_data(data: &ArrayDataRef, mut offset: usize, length: usize) -> ArrayDat
360363
Arc::new(new_data)
361364
}
362365

366+
// creates a new MutableBuffer initializes all falsed
367+
// this is useful to populate null bitmaps
368+
fn make_null_buffer(len: usize) -> MutableBuffer {
369+
let num_bytes = bit_util::ceil(len, 8);
370+
MutableBuffer::new(num_bytes).with_bitset(num_bytes, false)
371+
}
372+
363373
/// ----------------------------------------------------------------------------
364374
/// Implementations of different array types
365375
@@ -703,9 +713,7 @@ macro_rules! def_numeric_from_vec {
703713
{
704714
fn from(data: Vec<Option<<$ty as ArrowPrimitiveType>::Native>>) -> Self {
705715
let data_len = data.len();
706-
let num_bytes = bit_util::ceil(data_len, 8);
707-
let mut null_buf =
708-
MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
716+
let mut null_buf = make_null_buffer(data_len);
709717
let mut val_buf = MutableBuffer::new(
710718
data_len * mem::size_of::<<$ty as ArrowPrimitiveType>::Native>(),
711719
);
@@ -780,8 +788,7 @@ impl<T: ArrowTimestampType> PrimitiveArray<T> {
780788
pub fn from_opt_vec(data: Vec<Option<i64>>, timezone: Option<Arc<String>>) -> Self {
781789
// TODO: duplicated from def_numeric_from_vec! macro, it looks possible to convert to generic
782790
let data_len = data.len();
783-
let num_bytes = bit_util::ceil(data_len, 8);
784-
let mut null_buf = MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
791+
let mut null_buf = make_null_buffer(data_len);
785792
let mut val_buf = MutableBuffer::new(data_len * mem::size_of::<i64>());
786793

787794
{
@@ -812,8 +819,7 @@ impl<T: ArrowTimestampType> PrimitiveArray<T> {
812819
/// Constructs a boolean array from a vector. Should only be used for testing.
813820
impl From<Vec<bool>> for BooleanArray {
814821
fn from(data: Vec<bool>) -> Self {
815-
let num_byte = bit_util::ceil(data.len(), 8);
816-
let mut mut_buf = MutableBuffer::new(num_byte).with_bitset(num_byte, false);
822+
let mut mut_buf = make_null_buffer(data.len());
817823
{
818824
let mut_slice = mut_buf.data_mut();
819825
for (i, b) in data.iter().enumerate() {
@@ -834,7 +840,7 @@ impl From<Vec<Option<bool>>> for BooleanArray {
834840
fn from(data: Vec<Option<bool>>) -> Self {
835841
let data_len = data.len();
836842
let num_byte = bit_util::ceil(data_len, 8);
837-
let mut null_buf = MutableBuffer::new(num_byte).with_bitset(num_byte, false);
843+
let mut null_buf = make_null_buffer(data.len());
838844
let mut val_buf = MutableBuffer::new(num_byte).with_bitset(num_byte, false);
839845

840846
{
@@ -1642,9 +1648,7 @@ macro_rules! def_string_from_vec {
16421648
fn from(v: Vec<Option<&'a str>>) -> Self {
16431649
let mut offsets = Vec::with_capacity(v.len() + 1);
16441650
let mut values = Vec::new();
1645-
let num_bytes = bit_util::ceil(v.len(), 8);
1646-
let mut null_buf =
1647-
MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
1651+
let mut null_buf = make_null_buffer(v.len());
16481652
let mut length_so_far = 0;
16491653
offsets.push(length_so_far);
16501654
for (i, s) in v.iter().enumerate() {
@@ -2002,6 +2006,67 @@ impl From<ArrayDataRef> for StructArray {
20022006
}
20032007
}
20042008

2009+
impl TryFrom<Vec<(&str, ArrayRef)>> for StructArray {
2010+
type Error = ArrowError;
2011+
2012+
/// builds a StructArray from a vector of names and arrays.
2013+
/// This errors if the values have a different length.
2014+
/// An entry is set to Null when all values are null.
2015+
fn try_from(values: Vec<(&str, ArrayRef)>) -> Result<Self> {
2016+
let values_len = values.len();
2017+
2018+
// these will be populated
2019+
let mut fields = Vec::with_capacity(values_len);
2020+
let mut child_data = Vec::with_capacity(values_len);
2021+
2022+
// len: the size of the arrays.
2023+
let mut len: Option<usize> = None;
2024+
// null: the null mask of the arrays.
2025+
let mut null: Option<Buffer> = None;
2026+
for (field_name, array) in values {
2027+
let child_datum = array.data();
2028+
if let Some(len) = len {
2029+
if len != child_datum.len() {
2030+
return Err(ArrowError::InvalidArgumentError(
2031+
format!("Array of field \"{}\" has length {}, but previous elements have length {}.
2032+
All arrays in every entry in a struct array must have the same length.", field_name, child_datum.len(), len)
2033+
));
2034+
}
2035+
} else {
2036+
len = Some(child_datum.len())
2037+
}
2038+
child_data.push(child_datum.clone());
2039+
fields.push(Field::new(
2040+
field_name,
2041+
array.data_type().clone(),
2042+
child_datum.null_buffer().is_some(),
2043+
));
2044+
2045+
if let Some(child_null_buffer) = child_datum.null_buffer() {
2046+
null = Some(if let Some(null_buffer) = &null {
2047+
buffer_bin_or(null_buffer, 0, child_null_buffer, 0, null_buffer.len())
2048+
} else {
2049+
child_null_buffer.clone()
2050+
});
2051+
} else if null.is_some() {
2052+
// when one of the fields has no nulls, them there is no null in the array
2053+
null = None;
2054+
}
2055+
}
2056+
let len = len.unwrap();
2057+
2058+
let mut builder = ArrayData::builder(DataType::Struct(fields))
2059+
.len(len)
2060+
.child_data(child_data);
2061+
if let Some(null_buffer) = null {
2062+
let null_count = len - bit_util::count_set_bits(null_buffer.data());
2063+
builder = builder.null_count(null_count).null_bit_buffer(null_buffer);
2064+
}
2065+
2066+
Ok(StructArray::from(builder.build()))
2067+
}
2068+
}
2069+
20052070
impl Array for StructArray {
20062071
fn as_any(&self) -> &Any {
20072072
self
@@ -2382,7 +2447,7 @@ mod tests {
23822447

23832448
use crate::buffer::Buffer;
23842449
use crate::datatypes::{DataType, Field};
2385-
use crate::memory;
2450+
use crate::{bitmap::Bitmap, memory};
23862451

23872452
#[test]
23882453
fn test_primitive_array_from_vec() {
@@ -3858,6 +3923,92 @@ mod tests {
38583923
assert_eq!(0, struct_array.offset());
38593924
}
38603925

3926+
/// validates that the in-memory representation follows [the spec](https://arrow.apache.org/docs/format/Columnar.html#struct-layout)
3927+
#[test]
3928+
fn test_struct_array_from_vec() {
3929+
let strings: ArrayRef = Arc::new(StringArray::from(vec![
3930+
Some("joe"),
3931+
None,
3932+
None,
3933+
Some("mark"),
3934+
]));
3935+
let ints: ArrayRef =
3936+
Arc::new(Int32Array::from(vec![Some(1), Some(2), None, Some(4)]));
3937+
3938+
let arr =
3939+
StructArray::try_from(vec![("f1", strings.clone()), ("f2", ints.clone())])
3940+
.unwrap();
3941+
3942+
let struct_data = arr.data();
3943+
assert_eq!(4, struct_data.len());
3944+
assert_eq!(1, struct_data.null_count());
3945+
assert_eq!(
3946+
// 00001011
3947+
&Some(Bitmap::from(Buffer::from(&[11_u8]))),
3948+
struct_data.null_bitmap()
3949+
);
3950+
3951+
let expected_string_data = ArrayData::builder(DataType::Utf8)
3952+
.len(4)
3953+
.null_count(2)
3954+
.null_bit_buffer(Buffer::from(&[9_u8]))
3955+
.add_buffer(Buffer::from(&[0, 3, 3, 3, 7].to_byte_slice()))
3956+
.add_buffer(Buffer::from("joemark".as_bytes()))
3957+
.build();
3958+
3959+
let expected_int_data = ArrayData::builder(DataType::Int32)
3960+
.len(4)
3961+
.null_count(1)
3962+
.null_bit_buffer(Buffer::from(&[11_u8]))
3963+
.add_buffer(Buffer::from(&[1, 2, 0, 4].to_byte_slice()))
3964+
.build();
3965+
3966+
assert_eq!(expected_string_data, arr.column(0).data());
3967+
3968+
// TODO: implement equality for ArrayData
3969+
assert_eq!(expected_int_data.len(), arr.column(1).data().len());
3970+
assert_eq!(
3971+
expected_int_data.null_count(),
3972+
arr.column(1).data().null_count()
3973+
);
3974+
assert_eq!(
3975+
expected_int_data.null_bitmap(),
3976+
arr.column(1).data().null_bitmap()
3977+
);
3978+
let expected_value_buf = expected_int_data.buffers()[0].clone();
3979+
let actual_value_buf = arr.column(1).data().buffers()[0].clone();
3980+
for i in 0..expected_int_data.len() {
3981+
if !expected_int_data.is_null(i) {
3982+
assert_eq!(
3983+
expected_value_buf.data()[i * 4..(i + 1) * 4],
3984+
actual_value_buf.data()[i * 4..(i + 1) * 4]
3985+
);
3986+
}
3987+
}
3988+
}
3989+
3990+
#[test]
3991+
fn test_struct_array_from_vec_error() {
3992+
let strings: ArrayRef = Arc::new(StringArray::from(vec![
3993+
Some("joe"),
3994+
None,
3995+
None,
3996+
// 3 elements, not 4
3997+
]));
3998+
let ints: ArrayRef =
3999+
Arc::new(Int32Array::from(vec![Some(1), Some(2), None, Some(4)]));
4000+
4001+
let arr =
4002+
StructArray::try_from(vec![("f1", strings.clone()), ("f2", ints.clone())]);
4003+
4004+
match arr {
4005+
Err(ArrowError::InvalidArgumentError(e)) => {
4006+
assert!(e.starts_with("Array of field \"f2\" has length 4, but previous elements have length 3."));
4007+
}
4008+
_ => assert!(false, "This test got an unexpected error type"),
4009+
};
4010+
}
4011+
38614012
#[test]
38624013
#[should_panic(
38634014
expected = "the field data types must match the array data in a StructArray"

0 commit comments

Comments
 (0)