Carbohydrates, the third essential molecules after DNA and proteins, interact with proteins, influencing diverse biological processes and providing protection against pathogens. They also serve as biomarkers or drug targets in the human body.
Due to challenges in experimental techniques, there is a pressing need for developing computational methods to predict protein-carbohydrate binding sites. Computational studies use diverse structure-based and sequence-based methods, with some structured-based techniques relying on frequently unavailable protein structures.
We introduced 'StackCBEmbed,' a computational sequence-based ensemble model, to classify protein-carbohydrate binding interactions at the residue level, utilizing benchmark and testing datasets with high-resolution data. Applying Incremental Feature Selection, we extracted essential sequence-based features and incorporated embedding features from the 'T5-XL-Uniref50' language model, marking the first attempt to apply a protein language model in predicting these interactions. In our ensemble method, base predictors (SVM, MLP, ET and PLS), guided by average information gain scores, were trained on selected sequence-based and protein language model features, with their outcomes merged into original features for input into the meta-layer predictor (SVM).
Programming Language : Python 3.10.12
Machine Learning based library : Scikitlearn 1.2.2
IDE for python : PyCharm 2023.3.2
All libraries : xgboost==2.0.3
scops==0.9.0
pickle5==0.0.11
scikit-learn==1.2.2
numpy==1.26.2
pandas==2.1.4
matplotlib
PyQt5==5.15.10
numpy==1.26.2
imblearn==0.0
OS used during running: Pop!_OS 22.04
Benchmark(Training and validation) set: https://drive.google.com/file/d/1lDSVqmTT2wNRjv4JKlwHIMDAZ1_9IOPD/view?usp=sharing
TS49 : https://drive.google.com/file/d/1RfaecHJGfSsp3ny9-TE4xcc_HULiSpnc/view?usp=drive_link
TS88 : https://drive.google.com/file/d/1EKf7vBQypdcLxFVi8etZAZOdkCUcfraU/view?usp=drive_link
Benchmark(Training and validation) set: https://drive.google.com/drive/folders/1aqwOzRwhjQ_dg2UPOeKcD7VAaYyZDJP1?usp=drive_link
TS49 : https://drive.google.com/drive/folders/10riCKVAfq4pdMH3G10iTh35e02PFiM0z
TS88 : https://drive.google.com/drive/folders/1WMNEPerLjyr4C8g-ARFt1lnNvrxlJ_na
Benchmark(Training and validation) set: https://drive.google.com/drive/folders/1pVsBDg6p0bMd6aKleMPhMMS1mIfEbRmF
TS49 : https://drive.google.com/drive/folders/1xWvgVNQNsfV6JM9GZPr9hv5rBVJ5yMsB
TS88 : https://drive.google.com/drive/folders/17jZS-ygWVD-D2mJsFLLdnQv6IcedHQ0k
- You need to place the test protein sequences, along with their IDs, into the text file (StackCBEmbed\input_files\input.txt) in FASTA format to obtain predictions for protein-carbohydrate binding sites. It's possible to include several protein sequences simultaneously. Examples of these protein sequences are provided in the 'input.txt' file.
- [Optional] The PSSM file (.PSSM) containing evolution-derived information for the protein sequence mentioned in the .txt file.
You need to obtain the PSSM file from 'PSI-BLAST'. Download link: https://blast.ncbi.nlm.nih.gov/Blast.cgi.
Details for how to retrieve the PSSM files: https://github.com/mrzResearchArena/BLAST
PSSM generation is costly. - You need to extract the protein sequence's features using the ProtT5-XL-UniRef50 pretrained model of ProtTrans paper and save the features in a .csv file.
Code link(Requires collab pro for its high RAM requirement): https://github.com/agemagician/ProtTrans/blob/master/Embedding/TensorFlow/Advanced/ProtT5-XL-UniRef50.ipynb. Link to paper: https://pubmed.ncbi.nlm.nih.gov/34232869/
(ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning - PubMed (nih.gov))
- Go to the command prompt.
- Write the following command in the command prompt to download the 'StackCBEmbed' repository.
git clone "https://github.com/farah5112github/StackCBEmbed.git" - Navigate to the 'output_csvs' folder within the 'StackCBEmbed' directories(table_2_generation, table_3_generation, etc.). There you will find the target csvs for the paper.
- Navigate to the 'all_required_csvs' folder within the 'StackCBEmbed' directories(table_2_generation, table_3_generation, etc.).Place all the required files into the 'all_required_csvs' folder if you want to run the main.py. main.py will generate the csv files. The "readme.txt" file in each 'all_required_csvs' folder has all the required csv list and download link.
- For prediction folder,place all the required input files into the 'input_files' folder.
- Substitute the existing protein sequence in the input.txt file with your own sequence.
- Proceed to the 'StackCBEmbed codes' folder and execute main.py from the command prompt using the following command.
python main.py - During execution, you will be prompted to choose between using the original model or the model with embeddings only.
- The output files will be created in the 'output_csvs' directory.

