Fory row format is heavily inspired by spark tungsten row format, but with changes:
- Use arrow schema to describe meta.
- The implementation support java/C++/python/etc..
- String support latin/utf16/utf8 encoding.
- Decimal use arrow decimal format.
- Variable-size field can be inline in fixed-size region if small enough.
- Allow skip padding by generate Row using aot to put offsets in generated code.
The initial Fory java row data structure implementation is modified from spark unsafe row/writer.
See Encoders.bean Javadoc for a list built-in supported types.
To begin using the row format from Java, start with the Encoders class:
// Many built-in types and collections are supported
public record MyRecord(int key, String value) {}
// The encoder supplier is relatively expensive to create
// It is thread-safe and should be re-used
Supplier<RowEncoder<MyRecord>> encoderFactory =
Encoders.buildBeanCodec(MyRecord.class)
.build();
// Each individual encoder is relatively cheap to create
// It is not thread-safe, but may be reused by the same thread
var encoder = encoderFactory.get();
byte[] encoded = encoder.encode(new MyRecord(42, "Test"));
MyRecord deserialized = encoder.decode(encoded);
The default row format is cross-language compatible and alignment-padded for maximum performance. When data size is a greater concern, the compact format provides an alternate encoding that uses significantly less space.
Enable the compact codec on the encoder builder:
Supplier<RowEncoder<MyRecord>> encoderFactory =
Encoders.buildBeanCodec(MyRecord.class)
.compactEncoding()
.build();
Optimizations include:
- struct stores fixed-size fields (e.g. Int128. FixedSizeBinary) inline in fixed-data area without offset + size
- struct of all fixed-sized fields is itself considered fixed-size to store in other struct or array
- struct skips null bitmap if all fields are non-nullable
- struct sorts fields by fixed-size for best-effort (but not guaranteed) alignment
- struct can use less than 8 bytes for small data (int, short, etc)
- struct null bitmap stored at end of struct to borrow alignment padding if possible
- array stores fixed-size fields inline in fixed-data area without offset+size
- array header uses 4 bytes for size (since Collection and array are only int-sized) and leaves remaining 4 bytes for start of null bitmap
It is possible to register custom type handling and collection factories for the row format - see Encoders.registerCustomCodec and Encoders.registerCustomCollectionFactory. For an interface, Fory can synthesize a simple value implementation, such as the UuidType below.
A short example:
public interface UuidType {
UUID f1();
UUID[] f2();
SortedSet<UUID> f3();
}
static class UuidEncoder implements CustomCodec.MemoryBufferCodec<UUID> {
@Override
public MemoryBuffer encode(final UUID value) {
final MemoryBuffer result = MemoryBuffer.newHeapBuffer(16);
result.putInt64(0, value.getMostSignificantBits());
result.putInt64(8, value.getLeastSignificantBits());
return result;
}
@Override
public UUID decode(final MemoryBuffer value) {
return new UUID(value.readInt64(), value.readInt64());
}
}
static class SortedSetOfUuidDecoder implements CustomCollectionFactory<UUID, SortedSet<UUID>> {
@Override
public SortedSet<UUID> newCollection(final int size) {
return new TreeSet<>();
}
}
Encoders.registerCustomCodec(UUID.class, new UuidEncoder());
Encoders.registerCustomCollectionFactory(
SortedSet.class, UUID.class, new SortedSetOfUuidDecoder());
RowEncoder<UuidType> encoder = Encoders.bean(UuidType.class);