TASTE

This repository provides code for TASTE, a two-phase semantic type detection framework. Semantic type detection can drive a wide range of applications, such as table understanding, data cataloging and search, data quality validation, data transformation, data wrangling, etc. TASTE is particularly effective and efficient in semantic type detection when dealing with a large number of tables from diverse customers in the cloud.

Environment

You can use conda to create a virtual environment and install requirements as follow:

$ conda create --name taste python=3.6.9
$ conda activate taste
$ pip install -r requirements.txt

In addition, you need to set up a MySQL server (8.0.x is preferred).

Prepare Data

1. WikiTable data

Download the following files with the same name from the data directory of this link and put them in the data/wikitable directory at the root of the project.

├── data
    └── wikitable
        ├── train.table_col_type.json
        ├── dev.table_col_type.json
        └── test.table_col_type.json

2. GitTables data

Download all the zip files from this link, put them in the data/gittables/src_zip directory.

├── data
    └── gittables
        ├── src_zip
            ├── abstraction_tables_licensed.zip
            ├── allegro_con_spirito_tables_licensed.parquet
            └── ...

And then run unzip_gittables.sh to unzip them:

$ ./unzip_gittables.sh

The directory structure after unzipping:

├── data
    └── gittables
        └── unzipped
            ├── abstraction_tables_licensed
                ├── _1.parquet
                └── ...
            ├── allegro_con_spirito_tables_licensed
            └── ...

3. Pre-trained hybrid model

Download the pre-trained checkpoint from the checkpoint/pretrained directory of this link and put it in the checkpoints/pretrained_hybrid_model directory at the root of the project.

├── checkpoints
    └── pretrained_hybrid_model
        └── pytorch_model.bin

Run TASTE

Step1: Construct GitTables-100K datasets (Optional)

Run the following command to construct GitTables-100K datasets:

python data_process/gittables_selector.py

After running successfully, it will generate 3 json files in the data/gittables directory.

├── data
    └── gittables
        ├── train.gittables_100k.json
        ├── dev.gittables_100k.json
        └── test.gittables_100k.json

Step2: Split tables by column splitting threshold

Run the following command to split the tables with large column size into smaller ones during training:

python data_process/split_table.py \
    --col_split_threshold=20 \
    --train_dataset="data/wikitable/train.table_col_type.json" \
    --dev_dataset="data/wikitable/dev.table_col_type.json"

Parameter explanations:

--col_split_threshold: The column splitting threshold. If the table has more columns than the specified threshold, it will be split accordingly.
--train_dataset: The path of training dataset. Value should be:
- "data/wikitable/train.table_col_type.json" for WikiTable.
- "data/gittables/train.gittables_100k.json" for GitTables-100k.
--dev_dataset: The path of validation dataset. Value should be:
- "data/wikitable/dev.table_col_type.json" for WikiTable.
- "data/gittables/dev.gittables_100k.json" for GitTables-100k.

After running successfully, it will generate 2 json files in the data/wikitable directory.

├── data
    └── gittables
        ├── train.table_col_type_20c.json
        └── dev.table_col_type_20c.json

Step3: Fine-tune ADTD Model

Run the following command to fine tune the ADTD Model:

CUDA_VISIBLE_DEVICES="0"
python finetuning.py \
    --do_train \
    --train_dataset="data/wikitable/train.table_col_type_20c.json" \
    --dev_dataset="data/wikitable/dev.table_col_type_20c.json" \
    --type_vocab="type_vocab/wikitable/type_vocab.txt" \
    --hybrid_model_path="checkpoints/pretrained_hybrid_model/pytorch_model.bin" \
    --output_dir="<adtd_model_dir>" \
    --evaluate_during_training \
    --overwrite_output_dir \
    --max_cell_per_col 10 \
    --use_histogram_feature

Parameter explanations:

--do_train: Run training.
--train_dataset: The path of training dataset, generated by Step2.
--dev_dataset: The path of validation dataset, generated by Step2.
--type_vocab: The path of type_vocab file. Value should be:
- "type_vocab/wikitable/type_vocab.txt" for WikiTable.
- "type_vocab/wikitable/type_vocab_{k}.txt" for WikiTable-Sk, where {k} indicates the number of types randomly reserved from WikiTable. type_vocab_{k}.txt is generated by type_vocab/vocab_util.py.
- "type_vocab/gittables/type_vocab_1953.txt" for GitTables-100k.
--hybrid_model_path: The path of pre-trained hybrid model.
--output_dir: The ADTD model output directory. Different dataset or model setting should specify a different directory.
--evaluate_during_training: Run evaluation during training at each logging step.
--overwrite_output_dir: Overwrite the content of the output directory.
--max_cell_per_col: Specifies the maximum number of cells to use per column.
--use_histogram_feature: Use histogram feature to fine-tune model.

Step4: Build MySQL tables for testing

Run the following command to build MySQL tables for testing:

python build_mysql_table.py \
    --mysql_host=<mysql_host> \
    --mysql_port=<mysql_port> \
    --mysql_user=<mysql_user> \
    --mysql_password=<mysql_password> \
    --eval_database=<mysql_database> \
    --test_dataset="data/wikitable/test.table_col_type.json"

Parameter explanations:

--mysql_host: The hostname or IP address of the MySQL server.
--mysql_port: The port number of the MySQL server.
--mysql_user: The MySQL username for the connection.
--mysql_password: The password associated with the username.
--eval_database: An empty database used to store testing dataset. If it doesn't exist, the program will automatically create it.
--test_dataset: The path of testing dataset. Value should be:
- "data/wikitable/test.table_col_type.json" for WikiTable.
- "data/gittables/test.gittables_100k.json" for GitTables-100k.

Step5: Evaluation

Run the following command to get evaluation result:

CUDA_VISIBLE_DEVICES="0"
python evaluation.py \
    --mysql_host=<mysql_host> \
    --mysql_port=<mysql_port> \
    --mysql_user=<mysql_user> \
    --mysql_password=<mysql_password> \
    --eval_database=<mysql_database> \
    --test_dataset="data/wikitable/test.table_col_type.json" \
    --model_dir="<adtd_model_dir>" \
    --threshold_alpha=0.1 \
    --threshold_beta=0.9 \
    --disable_pipeline \
    --disable_cache \
    --disable_phase2

Parameter explanations:

--mysql_host: The hostname or IP address of the MySQL server.
--mysql_port: The port number of the MySQL server.
--mysql_user: The MySQL username for the connection.
--mysql_password: The password associated with the username.
--eval_database: The database that stores testing dataset. It should be consistent with the same parameter in Step4.
--test_dataset: The path of testing dataset. It should be consistent with the same parameter in Step4.
--model_dir: The path of fine-tuned ADTD model. It should be consistent with the --output_dir in Step3.
--threshold_alpha: The value of threshold α.
--threshold_beta: The value of threshold β.
--disable_pipeline: Disable pipelining execution.
--disable_cache: Disable caching execution.
--disable_phase2: Disable phase2.

After running successfully, it will print the evaluation results for the selected dataset, including execution time, F1-Score and ratio of scanned columns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TASTE

Environment

Prepare Data

1. WikiTable data

2. GitTables data

3. Pre-trained hybrid model

Run TASTE

Step1: Construct GitTables-100K datasets (Optional)

Step2: Split tables by column splitting threshold

Step3: Fine-tune ADTD Model

Step4: Build MySQL tables for testing

Step5: Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data_process		data_process
model		model
type_vocab		type_vocab
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_mysql_table.py		build_mysql_table.py
evaluation.py		evaluation.py
finetuning.py		finetuning.py
requirements.txt		requirements.txt
scan_json.py		scan_json.py
unzip_gittables.sh		unzip_gittables.sh

License

bytes-code/taste

Folders and files

Latest commit

History

Repository files navigation

TASTE

Environment

Prepare Data

1. WikiTable data

2. GitTables data

3. Pre-trained hybrid model

Run TASTE

Step1: Construct GitTables-100K datasets (Optional)

Step2: Split tables by column splitting threshold

Step3: Fine-tune ADTD Model

Step4: Build MySQL tables for testing

Step5: Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages