Skip to content
#

llm-as-judge

Here are 167 public repositories matching this topic...

lumen

Lumen — learner-owned AI education platform. Tell the AI what you want to learn: it builds you a private course in ~a minute, tutors you with course-scoped RAG + citations, and lets you share, clone & remix via a moderated catalog. BYOK, custom no-LangChain multi-agent orchestrator, golden evals in CI, MCP server. Live demo + public /eval.

  • Updated Jun 7, 2026
  • Python

Evidence-first multi-agent debate skill: get a second opinion by pitting Codex × Claude Code (or GLM/DeepSeek/Qwen) to independently review, red-team & judge high-stakes code and architecture decisions.

  • Updated Jun 9, 2026

Production-grade Playwright + TypeScript QA framework with AI-powered testing, LLM-as-Judge evaluation, MCP server, 7 CLI agents, security fuzzing, CI/CD pipelines, Jira sync, and Slack reporting — zero-config, plug-and-play.

  • Updated Apr 21, 2026
  • TypeScript
dojo.md

University for AI agents. 92 courses, 4400+ scenarios, any model via OpenRouter. Auto-training loops generate per-model SKILL.md documents. Works with Claude Code, OpenClaw, Cursor, Windsurf. No fine-tuning required.

  • Updated May 2, 2026
  • TypeScript

The course teaches how to fine-tune LLMs using Group Relative Policy Optimization (GRPO)—a reinforcement learning method that improves model reasoning with minimal data. Learn RFT concepts, reward design, LLM-as-a-judge evaluation, and deploy jobs on the Predibase platform.

  • Updated Jun 13, 2025
  • Jupyter Notebook
llm-fullstack-ai-agentic-system

🚀 Production-grade, full-stack agentic ecosystem engineered with DB-first approach using Turso & Drizzle, Generative UI built with Next.js & Server-Sent Events (SSE), a robust LLM-as-Judge evaluation suite. Fully containerized and orchestrated via Kubernetes & Helm, leverages modern DevOps with GitHub Actions & GHCR.io scalable cloud-native deploy

  • Updated Mar 21, 2026
  • TypeScript

Multi-agent research copilot that separates LLM generation from deterministic citation verification — HALLMARK F1-H 0.747 (full dev_public, N=1119). LangGraph pipeline with a failure-driven controller loop, evidence-grounded Critic, and local PDF RAG.

  • Updated Jun 8, 2026
  • Python

Improve this page

Add a description, image, and links to the llm-as-judge topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-as-judge topic, visit your repo's landing page and select "manage topics."

Learn more