Skip to content
View SidharthKriplani's full-sized avatar
🏠
Working from home
🏠
Working from home

Block or report SidharthKriplani

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
SidharthKriplani/README.md

Senior ML Engineer Β· Production Systems Β· Governance-First ML

LinkedIn Email


What I Build

I build production-grade ML and AI systems β€” not models in notebooks. Every project here is a complete platform: training, evaluation, governance, monitoring, serving, and fairness β€” with evidence artifacts that prove each component works.

My focus areas: agentic LLM pipelines with deterministic safety guarantees, recommendation and ranking systems, credit and risk decisioning, developer intelligence tooling, experiment analysis platforms, and ML data quality auditing.


Platforms

🎯 PulseRank β€” Marketplace Ranking & Recommendation

pulserank_platform

Production-simulated recommendation platform with IPS debiasing, MMR diversity reranking, delayed attribution, exposure governance, and offline A/B simulation.

  • IPS debiasing: click Γ— (1/propensity(rank)), clip max_weight=5.0 Β· Naive NDCG@10=0.134 β†’ IPS-weighted NDCG@10=0.522
  • MMR diversity: Ξ»=0.70, max_per_seller=2, category_cap=50% Β· Seller Gini 0.74 β†’ 0.582
  • A/B simulation: 4,132 sessions, 650 items, 80 sellers, 33 JSON artifacts, 15 failure scenarios
  • A/B decision: HOLD_SIMULATED (NDCG βˆ’7.9% relative, exceeds 5% threshold)

πŸ” DevPulse β€” Developer Migration Intelligence

devpulse_platform

RAG + agentic platform for version-safe developer change decisions. Detects conflicting documentation before LLM synthesis (LLM-Last architecture).

  • Hybrid retrieval: BM25 + vector + RRF with version pre-filtering Β· Recall@5 = 0.97
  • Conflict detection: 4 conflict types, 3 verdict levels (BLOCKED/RISKY/SAFE) Β· Wrong-version answer rate = 0.0
  • Goal Mode: 6 agentic components, retry cap = 2, recovery actions Β· Macro F1 = 0.966
  • Semver utilities: parse_semver, semver_lt, semver_distance (β‰₯2 β†’ CRITICAL, =1 β†’ HIGH)

πŸ’³ RiskFrame β€” Credit Default Risk Decisioning

riskframe_platform

Full ML lifecycle credit scoring platform on Home Credit dataset. Champion/challenger governance, ECOA fairness auditing, PSI drift monitoring, 5-gate promotion framework.

  • Champion XGBoost: PR AUC = 0.2611, ROC AUC = 0.7663, ECE = 0.0046, Brier = 0.0673
  • Fairness: Disparate Impact = 1.059 (ECOA safe harbor), SHAP-generated adverse action codes
  • Drift: Day 14 PSI = 0.2296 (DRIFT ALERT > 0.20) β†’ Policy v1.0 β†’ v1.1 deployed
  • Challenger: LightGBM 5/5 gates passed, HOLD (delta PR AUC = βˆ’0.0001, below material threshold)

πŸ“Š MetaSignal β€” A/B Experiment Analysis Platform

metasignal_platform

Production experiment analysis platform with CUPED variance reduction, guardrail-first decisioning, A/A calibration, streaming early warning, and SRM detection.

  • CUPED: Pre-experiment covariate adjustment Β· Variance reduction up to 40%
  • Guardrail-first: Experiment blocked if any guardrail metric degrades beyond threshold
  • A/A validation: 1,000 run calibration Β· False positive rate verified at Ξ± = 0.05
  • Streaming: Early warning system with sequential testing (mSPRT)

Libraries & Tools

πŸ“ MetricLens β€” Metric Movement Decomposition

metriclens

DataFrame-native library for decomposing metric movements into mix shift vs. rate shift. Answers "a metric moved β€” where did it come from?"

from metriclens import MetricLens
lens = MetricLens(df_before, df_after, segment_col="country", metric_col="conversion")
lens.decompose()  # β†’ mix_effect, rate_effect, interaction_effect per segment

πŸ§ͺ TrialCheck β€” A/B Experiment Readout Auditor

trialcheck_v0

Platform-agnostic A/B experiment auditor. Catches SRM, underpowered tests, peeking violations, and multiple comparison issues before you ship a wrong decision.

trialcheck audit experiment_results.json --alpha 0.05 --min-power 0.80

πŸ”¬ FeatureLeakageLens β€” Pre-Training Leakage Detection

featureleakagelens_v0

Pre-training tabular ML auditor for feature leakage. Detects target leakage, temporal leakage, train/test overlap, and near-duplicate features before you train.

from featureleakagelens import LeakageAuditor
report = LeakageAuditor(df_train, df_test, target="label").audit()

πŸ† GoldenSetAuditor β€” LLM/RAG Evaluation Dataset QA

goldensetauditor_v0

Evaluation dataset quality auditor for LLM and RAG applications. Validates golden sets for answer completeness, question ambiguity, context coverage, and contamination.

from goldensetauditor import GoldenSetAuditor
report = GoldenSetAuditor(golden_jsonl="eval_set.jsonl").audit()

πŸ“„ DocIngestQA β€” RAG Document Ingestion QA

docingestqa

Pre-indexing QA auditor for RAG document ingestion. Runs 11 deterministic checks on exported chunks β€” missing pages, OCR noise, duplicates, encoding corruption, poor split boundaries.

pip install docingestqa
docingestqa audit chunks.jsonl --output report.html

βš–οΈ InferenceLens β€” Inference Cost/Quality Tradeoff Auditor

inferencelens

Audits AI inference routing decisions β€” profiles calls across model configurations, finds the cost/quality Pareto frontier, and flags dominated configs with routing recommendations.

from inferencelens import InferenceLens
report = InferenceLens(task_type="summarization").profile(prompts)
# β†’ Pareto frontier, routing rules, PASS/WARN/FAIL verdict

Applied Systems

Three production pipelines applying the same principles under domain pressure:

System Domain Primary Failure Mode
lendflow Financial underwriting When to stop or escalate
agentreliabilitylab Cyber threat triage When to stop or escalate
nexussupply Supplier risk intelligence Conflicting signal fusion

Each uses LangGraph with deterministic-first architecture: scores are computed before LLM synthesis, escalation paths are explicit, and graceful degradation is guaranteed.


The Thesis

Every project here addresses one of three failure modes in AI systems:

β‘  How does it know it's working correctly?
The system needs a verification signal independent of its own confidence.
TrialCheck Β· MetricLens Β· FeatureLeakageLens Β· DocIngestQA Β· GoldenSetAuditor Β· RiskFrame Β· PulseRank Β· InferenceLens

β‘‘ When should it stop or escalate?
The system needs explicit rules for when automated decisions require human judgment.
MetaSignal Β· LendFlow Β· AgentReliabilityLab

β‘’ How does it handle conflicting information?
The system needs a deterministic anchor when multiple signals disagree.
DevPulse Β· NexusSupply

The original nine are domain-agnostic auditors. The three applied systems test the same thesis under production pressure.


How the Projects Connect

Raw Application Data ──────────────────────────────────► RiskFrame
                                                          (risk scoring)

User Queries / Developer Questions ────────────────────► DevPulse
                                                          (version-safe answers)

Marketplace Events / Clicks ────────────────────────────► PulseRank
                                                          (ranking + debiasing)

A/B Experiment Results ────────────────────────────────► MetaSignal / TrialCheck
                                                          (experiment validity)

ML Training Data ──────────────────────────────────────► FeatureLeakageLens
                                                          (pre-training audit)

RAG Chunks / Eval Sets ─────────────────────────────────► DocIngestQA / GoldenSetAuditor
                                                           (data quality gates)

Metric Movement Explanation ───────────────────────────► MetricLens
                                                          (decomposition)

Every platform feeds downstream governance: PulseRank's A/B decisions are validated by MetaSignal's experiment framework. RiskFrame's evaluation data is audited by FeatureLeakageLens pre-training and GoldenSetAuditor post-training. DevPulse's RAG corpus passes through DocIngestQA before indexing.


Engineering Standards

All repositories here follow a consistent production standard:

  • Tests: pytest with golden scenarios, integrity tests, and property-based checks
  • Artifacts: JSON evidence artifacts for all model and evaluation results
  • CI/CD: GitHub Actions with lint, type-check, and test gates
  • Versioning: Semantic versioning (major.minor.patch) with CHANGELOG
  • Documentation: Defense documents, PRD documents, and README-level architecture diagrams

All platforms are solo-built production simulations. Not deployed to production. Built to demonstrate production engineering depth.

Pinned Loading

  1. metasignal_platform metasignal_platform Public

    Production-simulated experimentation, metrics intelligence, CUPED/guardrail decisioning, and streaming observability platform.

    Python

  2. metriclens metriclens Public

    DataFrame-native metric movement decomposition β€” mix shift, rate shift, cross term. pip install metriclens

    Python

  3. devpulse_platform devpulse_platform Public

    Production-simulated RAG + agentic migration intelligence platform with version-safe retrieval, conflict detection, repo-aware risk analysis, patch simulation, and evidence dashboard.

    Python

  4. riskframe_platform riskframe_platform Public

    End-to-end credit decisioning platform: XGBoost + LightGBM challenger, Optuna HPO, SHAP, PSI drift, fairness, FastAPI

    Python

  5. docingestqa docingestqa Public

    Pre-indexing QA auditor for RAG document ingestion pipelines. 11 deterministic checks for missing pages, OCR noise, duplicates, encoding corruption, and poor split boundaries.

    Python

  6. pulserank_platform pulserank_platform Public

    Production-simulated marketplace ranking system: IPS bias correction, hybrid candidate generation, exposure governance, delayed attribution, offline A/B simulation, 33 evidence artifacts.

    Python