datacleaning.ipynb: clean the dataset with ogt labeled enzyme sequences. The resulted dataset is 'data/cleaned_ogts.fasta'
prepare_ogt_train_and_test_datasets.ipynb: split the obove dataset into train, validation and test datasets. Sample 10k sequences with original or uniform distributions for hyper-opt.
Under data/
-
cleaned_ogts_train.fasta -
cleaned_ogts_val.fasta -
cleaned_ogts_test.fasta -
ogt_for_hyperopt_original_distribution.fasta -
ogt_for_hyperopt_uniform_distribution.fasta