I build production-grade ML and AI systems β not models in notebooks. Every project here is a complete platform: training, evaluation, governance, monitoring, serving, and fairness β with evidence artifacts that prove each component works.
My focus areas: agentic LLM pipelines with deterministic safety guarantees, recommendation and ranking systems, credit and risk decisioning, developer intelligence tooling, experiment analysis platforms, and ML data quality auditing.
Production-simulated recommendation platform with IPS debiasing, MMR diversity reranking, delayed attribution, exposure governance, and offline A/B simulation.
- IPS debiasing: click Γ (1/propensity(rank)), clip max_weight=5.0 Β· Naive NDCG@10=0.134 β IPS-weighted NDCG@10=0.522
- MMR diversity: Ξ»=0.70, max_per_seller=2, category_cap=50% Β· Seller Gini 0.74 β 0.582
- A/B simulation: 4,132 sessions, 650 items, 80 sellers, 33 JSON artifacts, 15 failure scenarios
- A/B decision: HOLD_SIMULATED (NDCG β7.9% relative, exceeds 5% threshold)
RAG + agentic platform for version-safe developer change decisions. Detects conflicting documentation before LLM synthesis (LLM-Last architecture).
- Hybrid retrieval: BM25 + vector + RRF with version pre-filtering Β· Recall@5 = 0.97
- Conflict detection: 4 conflict types, 3 verdict levels (BLOCKED/RISKY/SAFE) Β· Wrong-version answer rate = 0.0
- Goal Mode: 6 agentic components, retry cap = 2, recovery actions Β· Macro F1 = 0.966
- Semver utilities: parse_semver, semver_lt, semver_distance (β₯2 β CRITICAL, =1 β HIGH)
Full ML lifecycle credit scoring platform on Home Credit dataset. Champion/challenger governance, ECOA fairness auditing, PSI drift monitoring, 5-gate promotion framework.
- Champion XGBoost: PR AUC = 0.2611, ROC AUC = 0.7663, ECE = 0.0046, Brier = 0.0673
- Fairness: Disparate Impact = 1.059 (ECOA safe harbor), SHAP-generated adverse action codes
- Drift: Day 14 PSI = 0.2296 (DRIFT ALERT > 0.20) β Policy v1.0 β v1.1 deployed
- Challenger: LightGBM 5/5 gates passed, HOLD (delta PR AUC = β0.0001, below material threshold)
Production experiment analysis platform with CUPED variance reduction, guardrail-first decisioning, A/A calibration, streaming early warning, and SRM detection.
- CUPED: Pre-experiment covariate adjustment Β· Variance reduction up to 40%
- Guardrail-first: Experiment blocked if any guardrail metric degrades beyond threshold
- A/A validation: 1,000 run calibration Β· False positive rate verified at Ξ± = 0.05
- Streaming: Early warning system with sequential testing (mSPRT)
DataFrame-native library for decomposing metric movements into mix shift vs. rate shift. Answers "a metric moved β where did it come from?"
from metriclens import MetricLens
lens = MetricLens(df_before, df_after, segment_col="country", metric_col="conversion")
lens.decompose() # β mix_effect, rate_effect, interaction_effect per segmentPlatform-agnostic A/B experiment auditor. Catches SRM, underpowered tests, peeking violations, and multiple comparison issues before you ship a wrong decision.
trialcheck audit experiment_results.json --alpha 0.05 --min-power 0.80Pre-training tabular ML auditor for feature leakage. Detects target leakage, temporal leakage, train/test overlap, and near-duplicate features before you train.
from featureleakagelens import LeakageAuditor
report = LeakageAuditor(df_train, df_test, target="label").audit()Evaluation dataset quality auditor for LLM and RAG applications. Validates golden sets for answer completeness, question ambiguity, context coverage, and contamination.
from goldensetauditor import GoldenSetAuditor
report = GoldenSetAuditor(golden_jsonl="eval_set.jsonl").audit()Pre-indexing QA auditor for RAG document ingestion. Runs 11 deterministic checks on exported chunks β missing pages, OCR noise, duplicates, encoding corruption, poor split boundaries.
pip install docingestqa
docingestqa audit chunks.jsonl --output report.htmlAudits AI inference routing decisions β profiles calls across model configurations, finds the cost/quality Pareto frontier, and flags dominated configs with routing recommendations.
from inferencelens import InferenceLens
report = InferenceLens(task_type="summarization").profile(prompts)
# β Pareto frontier, routing rules, PASS/WARN/FAIL verdictThree production pipelines applying the same principles under domain pressure:
| System | Domain | Primary Failure Mode |
|---|---|---|
lendflow |
Financial underwriting | When to stop or escalate |
agentreliabilitylab |
Cyber threat triage | When to stop or escalate |
nexussupply |
Supplier risk intelligence | Conflicting signal fusion |
Each uses LangGraph with deterministic-first architecture: scores are computed before LLM synthesis, escalation paths are explicit, and graceful degradation is guaranteed.
Every project here addresses one of three failure modes in AI systems:
β How does it know it's working correctly?
The system needs a verification signal independent of its own confidence.
TrialCheck Β· MetricLens Β· FeatureLeakageLens Β· DocIngestQA Β· GoldenSetAuditor Β· RiskFrame Β· PulseRank Β· InferenceLens
β‘ When should it stop or escalate?
The system needs explicit rules for when automated decisions require human judgment.
MetaSignal Β· LendFlow Β· AgentReliabilityLab
β’ How does it handle conflicting information?
The system needs a deterministic anchor when multiple signals disagree.
DevPulse Β· NexusSupply
The original nine are domain-agnostic auditors. The three applied systems test the same thesis under production pressure.
Raw Application Data βββββββββββββββββββββββββββββββββββΊ RiskFrame
(risk scoring)
User Queries / Developer Questions βββββββββββββββββββββΊ DevPulse
(version-safe answers)
Marketplace Events / Clicks βββββββββββββββββββββββββββββΊ PulseRank
(ranking + debiasing)
A/B Experiment Results βββββββββββββββββββββββββββββββββΊ MetaSignal / TrialCheck
(experiment validity)
ML Training Data βββββββββββββββββββββββββββββββββββββββΊ FeatureLeakageLens
(pre-training audit)
RAG Chunks / Eval Sets ββββββββββββββββββββββββββββββββββΊ DocIngestQA / GoldenSetAuditor
(data quality gates)
Metric Movement Explanation ββββββββββββββββββββββββββββΊ MetricLens
(decomposition)
Every platform feeds downstream governance: PulseRank's A/B decisions are validated by MetaSignal's experiment framework. RiskFrame's evaluation data is audited by FeatureLeakageLens pre-training and GoldenSetAuditor post-training. DevPulse's RAG corpus passes through DocIngestQA before indexing.
All repositories here follow a consistent production standard:
- Tests: pytest with golden scenarios, integrity tests, and property-based checks
- Artifacts: JSON evidence artifacts for all model and evaluation results
- CI/CD: GitHub Actions with lint, type-check, and test gates
- Versioning: Semantic versioning (major.minor.patch) with CHANGELOG
- Documentation: Defense documents, PRD documents, and README-level architecture diagrams