These examples reproduce the problems listed in the Transaction Process Council TPC-H benchmark. The purpose of these examples is to demonstrate how to use different aspects of Data Fusion and not necessarily geared towards creating the most performant queries possible. Within each example is a description of the problem. For users who are familiar with SQL style commands, you can compare the approaches in these examples with those listed in the specification.
The examples provided are based on version 2.18.0 of the TPC-H specification.
To run these examples, you must first generate a dataset. The dbgen tool
provided by TPC can create datasets of arbitrary scale. For testing it is
typically sufficient to create a 1 gigabyte dataset. For convenience, this
repository has a script which uses docker to create this dataset. From the
benchmarks/tpch directory execute the following script.
./tpch-gen.sh 1The examples provided use parquet files for the tables generated by dbgen.
A python script is provided to convert the text files from dbgen into parquet
files expected by the examples. From the examples/tpch directory you can
execute the following command to create the necessary parquet files.
python convert_data_to_parquet.pyFor easier access, a description of the techniques demonstrated in each file
is in the README.md file in the examples directory.