StackCBEmbed

Introduction

Carbohydrates, the third essential molecules after DNA and proteins, interact with proteins, influencing diverse biological processes and providing protection against pathogens. They also serve as biomarkers or drug targets in the human body.

Due to challenges in experimental techniques, there is a pressing need for developing computational methods to predict protein-carbohydrate binding sites. Computational studies use diverse structure-based and sequence-based methods, with some structured-based techniques relying on frequently unavailable protein structures.

We introduced 'StackCBEmbed,' a computational sequence-based ensemble model, to classify protein-carbohydrate binding interactions at the residue level, utilizing benchmark and testing datasets with high-resolution data. Applying Incremental Feature Selection, we extracted essential sequence-based features and incorporated embedding features from the 'T5-XL-Uniref50' language model, marking the first attempt to apply a protein language model in predicting these interactions. In our ensemble method, base predictors (SVM, MLP, ET and PLS), guided by average information gain scores, were trained on selected sequence-based and protein language model features, with their outcomes merged into original features for input into the meta-layer predictor (SVM).

Graphical abstract of Training

Graphical abstract of Prediction

Environmental Setup :

Programming Language : Python 3.10.12
Machine Learning based library : Scikitlearn 1.2.2
IDE for python : PyCharm 2023.3.2
All libraries : xgboost==2.0.3 scops==0.9.0 pickle5==0.0.11 scikit-learn==1.2.2 numpy==1.26.2 pandas==2.1.4 matplotlib PyQt5==5.15.10 numpy==1.26.2 imblearn==0.0
OS used during running: Pop!_OS 22.04

Dataset (protein_id, no of sequences, sequences and target output):

Benchmark(Training and validation) set: https://drive.google.com/file/d/1lDSVqmTT2wNRjv4JKlwHIMDAZ1_9IOPD/view?usp=sharing
TS49 : https://drive.google.com/file/d/1RfaecHJGfSsp3ny9-TE4xcc_HULiSpnc/view?usp=drive_link
TS88 : https://drive.google.com/file/d/1EKf7vBQypdcLxFVi8etZAZOdkCUcfraU/view?usp=drive_link

Dataset PSSM:

Benchmark(Training and validation) set: https://drive.google.com/drive/folders/1aqwOzRwhjQ_dg2UPOeKcD7VAaYyZDJP1?usp=drive_link
TS49 : https://drive.google.com/drive/folders/10riCKVAfq4pdMH3G10iTh35e02PFiM0z
TS88 : https://drive.google.com/drive/folders/1WMNEPerLjyr4C8g-ARFt1lnNvrxlJ_na

Dataset Embedding:

Benchmark(Training and validation) set: https://drive.google.com/drive/folders/1pVsBDg6p0bMd6aKleMPhMMS1mIfEbRmF
TS49 : https://drive.google.com/drive/folders/1xWvgVNQNsfV6JM9GZPr9hv5rBVJ5yMsB
TS88 : https://drive.google.com/drive/folders/17jZS-ygWVD-D2mJsFLLdnQv6IcedHQ0k

Input Requirements

You need to place the test protein sequences, along with their IDs, into the text file (StackCBEmbed\input_files\input.txt) in FASTA format to obtain predictions for protein-carbohydrate binding sites. It's possible to include several protein sequences simultaneously. Examples of these protein sequences are provided in the 'input.txt' file.
[Optional] The PSSM file (.PSSM) containing evolution-derived information for the protein sequence mentioned in the .txt file.
You need to obtain the PSSM file from 'PSI-BLAST'. Download link: https://blast.ncbi.nlm.nih.gov/Blast.cgi.
Details for how to retrieve the PSSM files: https://github.com/mrzResearchArena/BLAST
PSSM generation is costly.
You need to extract the protein sequence's features using the ProtT5-XL-UniRef50 pretrained model of ProtTrans paper and save the features in a .csv file.
Code link(Requires collab pro for its high RAM requirement): https://github.com/agemagician/ProtTrans/blob/master/Embedding/TensorFlow/Advanced/ProtT5-XL-UniRef50.ipynb. Link to paper: https://pubmed.ncbi.nlm.nih.gov/34232869/
(ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning - PubMed (nih.gov))

Run the StackCBEmbed model

Go to the command prompt.
Write the following command in the command prompt to download the 'StackCBEmbed' repository.
```
git clone "https://github.com/farah5112github/StackCBEmbed.git"
```
Navigate to the 'output_csvs' folder within the 'StackCBEmbed' directories(table_2_generation, table_3_generation, etc.). There you will find the target csvs for the paper.
Navigate to the 'all_required_csvs' folder within the 'StackCBEmbed' directories(table_2_generation, table_3_generation, etc.).Place all the required files into the 'all_required_csvs' folder if you want to run the main.py. main.py will generate the csv files. The "readme.txt" file in each 'all_required_csvs' folder has all the required csv list and download link.
For prediction folder,place all the required input files into the 'input_files' folder.
Substitute the existing protein sequence in the input.txt file with your own sequence.
Proceed to the 'StackCBEmbed codes' folder and execute main.py from the command prompt using the following command.
```
python main.py
```
During execution, you will be prompted to choose between using the original model or the model with embeddings only.
The output files will be created in the 'output_csvs' directory.

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
fig_1_generation		fig_1_generation
fig_2_generation		fig_2_generation
fig_3_generation		fig_3_generation
hyperparameter_tuning		hyperparameter_tuning
prediction		prediction
table_11-14		table_11-14
table_2_generation		table_2_generation
table_3_generation		table_3_generation
table_4_generation		table_4_generation
table_5_generation		table_5_generation
table_6_generation		table_6_generation
table_9-10		table_9-10
training		training
README.md		README.md
prediction.png		prediction.png
requirements.txt		requirements.txt
training.png		training.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StackCBEmbed

Introduction

Graphical abstract of Training

Graphical abstract of Prediction

Environmental Setup :

Dataset (protein_id, no of sequences, sequences and target output):

Dataset PSSM:

Dataset Embedding:

Input Requirements

Run the StackCBEmbed model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

farah5112github/StackCBEmbed

Folders and files

Latest commit

History

Repository files navigation

StackCBEmbed

Introduction

Graphical abstract of Training

Graphical abstract of Prediction

Environmental Setup :

Dataset (protein_id, no of sequences, sequences and target output):

Dataset PSSM:

Dataset Embedding:

Input Requirements

Run the StackCBEmbed model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages