Skip to content

Latest commit

 

History

History
278 lines (246 loc) · 13.2 KB

File metadata and controls

278 lines (246 loc) · 13.2 KB

Arrow JDBC Adapter

The Arrow JDBC Adapter assists with working with JDBC and Arrow data. Currently, it supports reading JDBC ResultSets into Arrow VectorSchemaRoots.

ResultSet to VectorSchemaRoot Conversion

This can be accessed via the JdbcToArrow class. The resulting ArrowVectorIterator will convert a ResultSet to Arrow data in batches of rows.

try (ArrowVectorIterator it = JdbcToArrow.sqlToArrowVectorIterator(resultSet, allocator)) {
  while (it.hasNext()) {
    VectorSchemaRoot root = it.next();
    // Consume the root…
  }
}

The batch size and type mapping can both be customized:

JdbcToArrowConfig config = new JdbcToArrowConfigBuilder(allocator, /*calendar=*/null)
    .setReuseVectorSchemaRoot(reuseVectorSchemaRoot)
    .setJdbcToArrowTypeConverter((jdbcFieldInfo -> {
      switch (jdbcFieldInfo.getJdbcType()) {
        case Types.BIGINT:
          // Assume actual value range is SMALLINT
          return new ArrowType.Int(16, true);
        default:
          return null;
      }
    }))
    .build();
try (ArrowVectorIterator iter = JdbcToArrow.sqlToArrowVectorIterator(rs, config)) {
  while (iter.hasNext()) {
    VectorSchemaRoot root = iter.next();
    // Consume the root…
  }
}

The JDBC type can be explicitly specified, which is useful since JDBC drivers can give spurious type information. For example, the Postgres driver has been observed to use Decimal types with scale and precision 0; these cases can be handled by specifying the type explicitly before reading. Also, some JDBC drivers may return BigDecimal values with inconsistent scale. A RoundingMode can be set to handle these cases:

Map<Integer, JdbcFieldInfo> mapping = new HashMap<>();
mapping.put(1, new JdbcFieldInfo(Types.DECIMAL, 20, 7));
JdbcToArrowConfig config = new JdbcToArrowConfigBuilder(allocator, /*calendar=*/null)
    .setBigDecimalRoundingMode(RoundingMode.UNNECESSARY)
    .setExplicitTypesByColumnIndex(mapping)
    .build();
try (ArrowVectorIterator iter = JdbcToArrow.sqlToArrowVectorIterator(rs, config)) {
  while (iter.hasNext()) {
    VectorSchemaRoot root = iter.next();
    // Consume the root…
  }
}

The mapping from JDBC type to Arrow type can be overridden via the JdbcToArrowConfig, but it is not possible to customize the conversion from JDBC value to Arrow value itself, nor is it possible to define a conversion for an unsupported type.

Type Mapping

The JDBC to Arrow type mapping can be obtained at runtime from JdbcToArrowUtils.getArrowTypeFromJdbcType.

JDBC Type Arrow Type Notes
ARRAY List (1)
BIGINT Int64  
BINARY Binary  
BIT Bool  
BLOB Binary  
BOOLEAN Bool  
CHAR Utf8  
CLOB Utf8  
DATE Date32  
DECIMAL Decimal128 (2)
DOUBLE Double  
FLOAT Float32  
INTEGER Int32  
LONGVARBINARY Binary  
LONGNVARCHAR Utf8  
LONGVARCHAR Utf8  
NCHAR Utf8  
NULL Null  
NUMERIC Decimal128  
NVARCHAR Utf8  
REAL Float32  
SMALLINT Int16  
STRUCT Struct (3)
TIME Time32[ms]  
TIMESTAMP Timestamp[ms] (4)
TINYINT Int8  
VARBINARY Binary  
VARCHAR Utf8  
  • (1) The list value type must be explicitly configured and cannot be inferred. Use setArraySubTypeByColumnIndexMap or setArraySubTypeByColumnNameMap.
  • (2) By default, the scale of decimal values must match the scale in the type exactly; precision is allowed to be any value greater or equal to the type precision. If there is a mismatch, by default, an exception will be thrown. This can be configured by setting a different RoundingMode with setBigDecimalRoundingMode.
  • (3) Not fully supported: while the type conversion is defined, the value conversion is not. See ARROW-17006.
  • (4) If a Calendar is provided, then the timestamp will have the timezone of the calendar, else it will be a timestamp without timezone.

VectorSchemaRoot to PreparedStatement Parameter Conversion

The adapter can bind rows of Arrow data from a VectorSchemaRoot to parameters of a JDBC PreparedStatement. This can be accessed via the JdbcParameterBinder class. Each call to next() will bind parameters from the next row of data, and then the application can execute the statement, call addBatch(), etc. as desired. Null values will lead to a setNull call with an appropriate JDBC type code (listed below).

final JdbcParameterBinder binder =
    JdbcParameterBinder.builder(statement, root).bindAll().build();
while (binder.next()) {
    statement.executeUpdate();
}
// Use a VectorLoader to update the root
binder.reset();
while (binder.next()) {
    statement.executeUpdate();
}

The mapping of vectors to parameters, the JDBC type code used by the converters, and the type conversions themselves can all be customized:

final JdbcParameterBinder binder =
    JdbcParameterBinder.builder(statement, root)
        .bind(/*parameterIndex*/2, /*columnIndex*/0)
        .bind(/*parameterIndex*/1, customColumnBinderInstance)
        .build();

Type Mapping

The Arrow to JDBC type mapping can be obtained at runtime via a method on ColumnBinder.

Arrow Type JDBC Type Notes
Binary VARBINARY (setBytes)  
Bool BOOLEAN (setBoolean)  
Date32 DATE (setDate)  
Date64 DATE (setDate)  
Decimal128 DECIMAL (setBigDecimal)  
Decimal256 DECIMAL (setBigDecimal)  
FixedSizeBinary BINARY (setBytes)  
Float32 REAL (setFloat)  
Int8 TINYINT (setByte)  
Int16 SMALLINT (setShort)  
Int32 INTEGER (setInt)  
Int64 BIGINT (setLong)  
LargeBinary LONGVARBINARY (setBytes)  
LargeUtf8 LONGVARCHAR (setString) (1)
Time[s] TIME (setTime)  
Time[ms] TIME (setTime)  
Time[us] TIME (setTime)  
Time[ns] TIME (setTime)  
Timestamp[s] TIMESTAMP (setTimestamp) (2)
Timestamp[ms] TIMESTAMP (setTimestamp) (2)
Timestamp[us] TIMESTAMP (setTimestamp) (2)
Timestamp[ns] TIMESTAMP (setTimestamp) (2)
Utf8 VARCHAR (setString)  
  • (1) Strings longer than Integer.MAX_VALUE bytes (the maximum length of a Java byte[]) will cause a runtime exception.
  • (2) If the timestamp has a timezone, the JDBC type defaults to TIMESTAMP_WITH_TIMEZONE. If the timestamp has no timezone, technically there is not a correct conversion from Arrow value to JDBC value, because a JDBC Timestamp is in UTC, and we have no timezone information. In this case, the default binder will call setTimestamp(int, Timestamp), which will lead to the driver using the "default timezone" (that of the Java VM).