This repository provides code for TASTE, a two-phase semantic type detection framework. Semantic type detection can drive a wide range of applications, such as table understanding, data cataloging and search, data quality validation, data transformation, data wrangling, etc. TASTE is particularly effective and efficient in semantic type detection when dealing with a large number of tables from diverse customers in the cloud.
You can use conda to create a virtual environment and install requirements as follow:
$ conda create --name taste python=3.6.9
$ conda activate taste
$ pip install -r requirements.txtIn addition, you need to set up a MySQL server (8.0.x is preferred).
Download the following files with the same name from the data directory of this link and put them in the data/wikitable directory at the root of the project.
├── data
└── wikitable
├── train.table_col_type.json
├── dev.table_col_type.json
└── test.table_col_type.json
Download all the zip files from this link, put them in the data/gittables/src_zip directory.
├── data
└── gittables
├── src_zip
├── abstraction_tables_licensed.zip
├── allegro_con_spirito_tables_licensed.parquet
└── ...
And then run unzip_gittables.sh to unzip them:
$ ./unzip_gittables.shThe directory structure after unzipping:
├── data
└── gittables
└── unzipped
├── abstraction_tables_licensed
├── _1.parquet
└── ...
├── allegro_con_spirito_tables_licensed
└── ...
Download the pre-trained checkpoint from the checkpoint/pretrained directory of this link and put it in the checkpoints/pretrained_hybrid_model directory at the root of the project.
├── checkpoints
└── pretrained_hybrid_model
└── pytorch_model.bin
Run the following command to construct GitTables-100K datasets:
python data_process/gittables_selector.pyAfter running successfully, it will generate 3 json files in the data/gittables directory.
├── data
└── gittables
├── train.gittables_100k.json
├── dev.gittables_100k.json
└── test.gittables_100k.json
Run the following command to split the tables with large column size into smaller ones during training:
python data_process/split_table.py \
--col_split_threshold=20 \
--train_dataset="data/wikitable/train.table_col_type.json" \
--dev_dataset="data/wikitable/dev.table_col_type.json"Parameter explanations:
--col_split_threshold: The column splitting threshold. If the table has more columns than the specified threshold, it will be split accordingly.--train_dataset: The path of training dataset. Value should be:"data/wikitable/train.table_col_type.json"for WikiTable."data/gittables/train.gittables_100k.json"for GitTables-100k.
--dev_dataset: The path of validation dataset. Value should be:"data/wikitable/dev.table_col_type.json"for WikiTable."data/gittables/dev.gittables_100k.json"for GitTables-100k.
After running successfully, it will generate 2 json files in the data/wikitable directory.
├── data
└── gittables
├── train.table_col_type_20c.json
└── dev.table_col_type_20c.json
Run the following command to fine tune the ADTD Model:
CUDA_VISIBLE_DEVICES="0"
python finetuning.py \
--do_train \
--train_dataset="data/wikitable/train.table_col_type_20c.json" \
--dev_dataset="data/wikitable/dev.table_col_type_20c.json" \
--type_vocab="type_vocab/wikitable/type_vocab.txt" \
--hybrid_model_path="checkpoints/pretrained_hybrid_model/pytorch_model.bin" \
--output_dir="<adtd_model_dir>" \
--evaluate_during_training \
--overwrite_output_dir \
--max_cell_per_col 10 \
--use_histogram_featureParameter explanations:
--do_train: Run training.--train_dataset: The path of training dataset, generated by Step2.--dev_dataset: The path of validation dataset, generated by Step2.--type_vocab: The path of type_vocab file. Value should be:"type_vocab/wikitable/type_vocab.txt"for WikiTable."type_vocab/wikitable/type_vocab_{k}.txt"for WikiTable-Sk, where {k} indicates the number of types randomly reserved from WikiTable.type_vocab_{k}.txtis generated bytype_vocab/vocab_util.py."type_vocab/gittables/type_vocab_1953.txt"for GitTables-100k.
--hybrid_model_path: The path of pre-trained hybrid model.--output_dir: The ADTD model output directory. Different dataset or model setting should specify a different directory.--evaluate_during_training: Run evaluation during training at each logging step.--overwrite_output_dir: Overwrite the content of the output directory.--max_cell_per_col: Specifies the maximum number of cells to use per column.--use_histogram_feature: Use histogram feature to fine-tune model.
Run the following command to build MySQL tables for testing:
python build_mysql_table.py \
--mysql_host=<mysql_host> \
--mysql_port=<mysql_port> \
--mysql_user=<mysql_user> \
--mysql_password=<mysql_password> \
--eval_database=<mysql_database> \
--test_dataset="data/wikitable/test.table_col_type.json"Parameter explanations:
--mysql_host: The hostname or IP address of the MySQL server.--mysql_port: The port number of the MySQL server.--mysql_user: The MySQL username for the connection.--mysql_password: The password associated with the username.--eval_database: An empty database used to store testing dataset. If it doesn't exist, the program will automatically create it.--test_dataset: The path of testing dataset. Value should be:"data/wikitable/test.table_col_type.json"for WikiTable."data/gittables/test.gittables_100k.json"for GitTables-100k.
Run the following command to get evaluation result:
CUDA_VISIBLE_DEVICES="0"
python evaluation.py \
--mysql_host=<mysql_host> \
--mysql_port=<mysql_port> \
--mysql_user=<mysql_user> \
--mysql_password=<mysql_password> \
--eval_database=<mysql_database> \
--test_dataset="data/wikitable/test.table_col_type.json" \
--model_dir="<adtd_model_dir>" \
--threshold_alpha=0.1 \
--threshold_beta=0.9 \
--disable_pipeline \
--disable_cache \
--disable_phase2Parameter explanations:
--mysql_host: The hostname or IP address of the MySQL server.--mysql_port: The port number of the MySQL server.--mysql_user: The MySQL username for the connection.--mysql_password: The password associated with the username.--eval_database: The database that stores testing dataset. It should be consistent with the same parameter in Step4.--test_dataset: The path of testing dataset. It should be consistent with the same parameter in Step4.--model_dir: The path of fine-tuned ADTD model. It should be consistent with the--output_dirin Step3.--threshold_alpha: The value of threshold α.--threshold_beta: The value of threshold β.--disable_pipeline: Disable pipelining execution.--disable_cache: Disable caching execution.--disable_phase2: Disable phase2.
After running successfully, it will print the evaluation results for the selected dataset, including execution time, F1-Score and ratio of scanned columns.