This project aims to classify email messages as Spam or Ham (Not Spam) using Natural Language Processing and machine learning models in Python.
We experimented with two different approaches:
1️⃣ LSTM Model built from scratch using PyTorch
2️⃣ Pretrained RoBERTa model from Hugging Face
Below are the accuracy results from each model:
| Model | Accuracy | Notes |
|---|---|---|
| LSTM (PyTorch) | ~88% | Trained on cleaned spam/ham dataset |
| RoBERTa (HuggingFace) | ~66% | Stronger performance on unseen emails |
Spam/Ham labeled dataset included in repository:
📄 spam_ham_dataset.csv
| File | Description |
|---|---|
mail_classification_sample.py |
Classify a single email |
evaluate_LSTM.py |
Evaluate LSTM model performance |
evaluate_model_with_RoBERTa.py |
Evaluate RoBERTa classifier |
requirements.txt |
Dependencies |
- Tokenization, embedding, LSTM layers
- Binary classification (spam vs ham)
Using pretrained model from Hugging Face:
➡️ https://huggingface.co/roshana1s/spam-message-classifier
This model is fine-tuned for spam text detection.
A large part of this project is inspired by:
-
Detecting spam emails using TensorFlow (GeeksForGeeks):
https://www.geeksforgeeks.org/nlp/detecting-spam-emails-using-tensorflow-in-python/ -
RoBERTa Spam Classifier (HuggingFace):
https://huggingface.co/roshana1s/spam-message-classifier
Thanks to the authors for their great tutorials and models.
conda activate mail_verific
pip install -r requirements.txt
python mail_classification_sample.py
