Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

README.md

Fory Row Format

Fory row format is heavily inspired by spark tungsten row format, but with changes:

  • Use arrow schema to describe meta.
  • The implementation support java/C++/python/etc..
  • String support latin/utf16/utf8 encoding.
  • Decimal use arrow decimal format.
  • Variable-size field can be inline in fixed-size region if small enough.
  • Allow skip padding by generate Row using aot to put offsets in generated code.

The initial Fory java row data structure implementation is modified from spark unsafe row/writer.

See Encoders.bean Javadoc for a list built-in supported types.

Row Format Java

To begin using the row format from Java, start with the Encoders class:

// Many built-in types and collections are supported
public record MyRecord(int key, String value) {}

// The encoder supplier is relatively expensive to create
// It is thread-safe and should be re-used
Supplier<RowEncoder<MyRecord>> encoderFactory =
    Encoders.buildBeanCodec(MyRecord.class)
            .build();

// Each individual encoder is relatively cheap to create
// It is not thread-safe, but may be reused by the same thread
var encoder = encoderFactory.get();
byte[] encoded = encoder.encode(new MyRecord(42, "Test"));

MyRecord deserialized = encoder.decode(encoded);

Compact Format

The default row format is cross-language compatible and alignment-padded for maximum performance. When data size is a greater concern, the compact format provides an alternate encoding that uses significantly less space.

Enable the compact codec on the encoder builder:

Supplier<RowEncoder<MyRecord>> encoderFactory =
    Encoders.buildBeanCodec(MyRecord.class)
            .compactEncoding()
            .build();

Optimizations include:

  • struct stores fixed-size fields (e.g. Int128. FixedSizeBinary) inline in fixed-data area without offset + size
  • struct of all fixed-sized fields is itself considered fixed-size to store in other struct or array
  • struct skips null bitmap if all fields are non-nullable
  • struct sorts fields by fixed-size for best-effort (but not guaranteed) alignment
  • struct can use less than 8 bytes for small data (int, short, etc)
  • struct null bitmap stored at end of struct to borrow alignment padding if possible
  • array stores fixed-size fields inline in fixed-data area without offset+size
  • array header uses 4 bytes for size (since Collection and array are only int-sized) and leaves remaining 4 bytes for start of null bitmap

Custom Type Registration

It is possible to register custom type handling and collection factories for the row format - see Encoders.registerCustomCodec and Encoders.registerCustomCollectionFactory. For an interface, Fory can synthesize a simple value implementation, such as the UuidType below.

A short example:

public interface UuidType {
  UUID f1();
  UUID[] f2();
  SortedSet<UUID> f3();
}

static class UuidEncoder implements CustomCodec.MemoryBufferCodec<UUID> {
  @Override
  public MemoryBuffer encode(final UUID value) {
    final MemoryBuffer result = MemoryBuffer.newHeapBuffer(16);
    result.putInt64(0, value.getMostSignificantBits());
    result.putInt64(8, value.getLeastSignificantBits());
    return result;
  }

  @Override
  public UUID decode(final MemoryBuffer value) {
    return new UUID(value.readInt64(), value.readInt64());
  }
}

static class SortedSetOfUuidDecoder implements CustomCollectionFactory<UUID, SortedSet<UUID>> {
  @Override
  public SortedSet<UUID> newCollection(final int size) {
    return new TreeSet<>();
  }
}

Encoders.registerCustomCodec(UUID.class, new UuidEncoder());
Encoders.registerCustomCollectionFactory(
    SortedSet.class, UUID.class, new SortedSetOfUuidDecoder());

RowEncoder<UuidType> encoder = Encoders.bean(UuidType.class);