ScriptGuard is an advanced AI-powered system designed to detect malicious and dangerous scripts using state-of-the-art LLM techniques, ZenML pipelines, RAG architecture, and comprehensive data sources.
- Multi-Source Data Collection: GitHub, MalwareBazaar, Hugging Face, CVE Feeds
- Advanced Preprocessing: Syntax validation, quality filtering, feature extraction
- Intelligent Augmentation: Code obfuscation, polymorphic variant generation
- Few-Shot RAG: Code similarity search for context-aware classification (NEW - EXPERIMENTAL)
- Database Management: PostgreSQL-based dataset versioning and deduplication
- Production-Ready: FastAPI inference, Docker deployment, RAG with Qdrant
- Optimized Training: Unsloth & Flash Attention 2 support for faster fine-tuning
- Sources: GitHub API, MalwareBazaar, Hugging Face Datasets, NVD CVE Feeds
- Validation: AST syntax checking, encoding validation, quality metrics
- Augmentation: Base64/hex obfuscation, variable renaming, code mutation
- Features: Entropy analysis, API pattern detection, AST features
- Base Model:
bigcode/starcoder2-3b(Optimized for code analysis) - Fine-tuning: Parameter-efficient fine-tuning using QLoRA (4-bit quantization) with Unsloth optimization
- Few-Shot RAG: Code similarity search using microsoft/unixcoder-base embeddings (NEW)
- Orchestration: ZenML manages the end-to-end ML lifecycle
- RAG: Qdrant stores embeddings of known CVEs and code samples
- Tracking: Comet.ml / WandB monitors experiments and metrics
- Inference: FastAPI provides high-performance REST API
- Containerization: Docker Compose orchestrates services
- Database: PostgreSQL for dataset management and versioning
- Language: Python 3.12
- Database: PostgreSQL 15 (with connection pooling)
- Vector DB: Qdrant (enhanced RAG)
- Package Manager:
uv - Orchestration: ZenML
- Fine-tuning: PEFT (LoRA/QLoRA), Unsloth, Flash Attention 2
- Experiment Tracking: WandB / Comet.ml
- Serving: FastAPI + Uvicorn
- Containerization: Docker (multistage builds)
- Monitoring: Prometheus + Grafana (optional)
βββ docker/ # Containerization configs
βββ src/
β βββ scriptguard/
β β βββ api/ # FastAPI inference service
β β βββ data_sources/ # Multi-source data collectors
β β βββ database/ # Dataset management
β β βββ monitoring/ # Statistics & monitoring
β β βββ models/ # QLoRA fine-tuning logic
β β βββ pipelines/ # ZenML pipeline definitions
β β βββ rag/ # Qdrant RAG store
β β βββ steps/ # ZenML steps
β βββ main.py # Pipeline entry point
βββ docs/ # Comprehensive documentation
βββ config.yaml # Central configuration
βββ zenml_config.yaml # ZenML step configuration
βββ .env.example # Environment variables template
βββ pyproject.toml # Dependency management
βββ podrun-setup.sh # RunPod setup script
βββ dev-setup.sh # Local development setup script
βββ connect.sh # SSH tunnel script
- Python 3.12
- GPU: NVIDIA GPU with 16GB+ VRAM (recommended for training)
- CUDA: 12.4
uvinstalled:curl -LsSf https://astral.sh/uv/install.sh | sh- Docker (optional for deployment)
git clone https://github.com/yourusername/ScriptGuard.git
cd ScriptGuardWe use uv for fast and reliable dependency management.
# Install dependencies (including PyTorch with CUDA 12.4)
uv sync# Copy environment template
cp .env.example .env
# Edit .env and add your API keys
nano .env # or use your preferred editor| Component | Minimum | Recommended |
|---|---|---|
| GPU | None (CPU) | NVIDIA RTX 3090/4090 (24GB VRAM) |
| RAM | 16GB | 32GB+ |
| Storage | 50GB | 100GB+ |
| CUDA | N/A | 12.4 |
Edit config.yaml to configure data sources, training parameters, and RAG settings. The default configuration is optimized for RunPod (RTX 3090/4090).
For running training pipelines on Podrun with ZenML, use the automated setup scripts:
Linux/macOS:
chmod +x podrun-setup.sh
./podrun-setup.shWindows (PowerShell):
.\podrun-setup.ps1For local development with Dockerized infrastructure (Postgres, Qdrant):
Linux/macOS:
chmod +x dev-setup.sh
./dev-setup.shWindows:
dev-setup.batIf you are deploying on a remote server and want to access services locally:
chmod +x connect.sh
./connect.sh# Run advanced training pipeline
uv run python src/main.pyThe pipeline will:
- Collect data from configured sources
- Validate and filter samples
- Extract features and augment data
- Train model with QLoRA (using Unsloth optimizations)
- Evaluate performance
Start inference API:
# Using Docker (Recommended for Production)
docker-compose up -d api
# Or directly (Local Development)
uvicorn scriptguard.api.main:app --host 0.0.0.0 --port 8000curl -X POST "http://localhost:8000/analyze" \
-H "Content-Type: application/json" \
-d '{
"code": "import os; os.system(\"rm -rf /\")"
}'Response:
{
"label": "malicious",
"confidence": 0.98,
"risk_score": 9.5,
"dangerous_patterns": ["os.system"],
"explanation": "Uses os.system for dangerous command execution"
}- ARCHITECTURE.md - System architecture and component details
- TRAINING_GUIDE.md - Complete training guide
- USAGE_GUIDE.md - API usage and integration
- TUNING_GUIDE.md - Hyperparameter tuning
- DEPLOYMENT.md - Production deployment guide
- LOCAL_DEVELOPMENT.md - Local development guide
- QDRANT_SETUP.md - Qdrant RAG setup
- PODRUN_README.md - Podrun specific documentation
ScriptGuard includes a Code Similarity Search system to potentially improve inference:
How it works:
- Vectorization: Code samples from PostgreSQL are embedded using
microsoft/unixcoder-base - Storage: Embeddings stored in Qdrant vector database
- Retrieval: During inference, finds k=3 most similar code examples
- Context: Similar examples added to prompt (Few-Shot Learning)
ScriptGuard collects training data from multiple sources:
- GitHub
- MalwareBazaar
- Hugging Face
- CVE Feeds
- Additional Datasets: InQuest, dhuynh/malware-classification, malicious-urls
Automatically extracts:
- AST-based features (function calls, imports, patterns)
- Shannon entropy
- API call patterns
- Suspicious string patterns
Generates polymorphic variants using:
- Base64/hex encoding obfuscation
- Variable renaming
- String splitting
- Code mutation
- Qdrant CVE Pattern Augmentation
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request
MIT License - see LICENSE file
ScriptGuard is designed for defensive security purposes only. Do not use to create, modify, or improve malicious code.
- GitHub Issues: Report bugs or request features
- Documentation: Full docs at docs/