Using Trino SQL (actually AWS Athena implementation of Trino), I want to compute safe hashs of arbitrary MAP columns. By "arbitrary" I mean MAP that may have other MAP as values for certain keys. Example where I SELECT 2 rows being the same MAP defined with different key ordering:
WITH test_data(some_map) AS (
VALUES
( -- Map with key order: 'a', 'b'
MAP(ARRAY [ 'a', 'b' ], ARRAY [ ARRAY [ 'x' ], ARRAY [ 'y' ] ])
),
( -- Map with key order: 'b', 'a'
MAP(ARRAY [ 'b', 'a' ], ARRAY [ ARRAY [ 'y' ], ARRAY [ 'x' ] ])
)
)
SELECT some_map
FROM test_data
Both entries represent the same object, therefore I want to compute the same hash for them.
Here is my current solution involving a first cast to JSON, then a serialization through json_format before starting the actual hashing task:
SELECT
some_map,
xxhash64(
CAST(
json_format(
CAST(
some_map AS JSON
)
)
AS VARBINARY
)
) as hashed
FROM test_data
Empirically this trick seems to work because the Trino's internal representation of a MAP seems to be key-ordered ... but I cannot be sure of the hash function remaining consistent with these empirical observations.
Question is twofold:
- is it someway a guarantee of keys in a MAP to be ordered ?
- if the first answer is no, what would be a safe way of computing hash of such MAP ?