#

llm-as-judge

Here are 167 public repositories matching this topic...

jerry609 / PaperBot

Academic Personal AI Infrastructure

research paper nextjs multi-agent arxiv scholar rag fastapi paper2code daily-paper llm llm-as-judge

Updated May 8, 2026
Python

lumen

ahmedEid1 / lumen

Lumen — learner-owned AI education platform. Tell the AI what you want to learn: it builds you a private course in ~a minute, tutors you with course-scoped RAG + citations, and lets you share, clone & remix via a moderated catalog. BYOK, custom no-LangChain multi-agent orchestrator, golden evals in CI, MCP server. Live demo + public /eval.

python docker postgres typescript mcp nextjs e-learning celery observability rag fastapi llm byok pgvector evals llm-as-judge ai-tutor agentic-ai

Updated Jun 7, 2026
Python

dokimos-dev / dokimos

LLM and agent evaluation for Java & Kotlin. Runs in JUnit and CI. Spring AI, LangChain4j, Koog, Embabel, and any LLM client.

Updated Jun 9, 2026
Java

zhjai / agent-arena

Evidence-first multi-agent debate skill: get a second opinion by pitting Codex × Claude Code (or GLM/DeepSeek/Qwen) to independently review, red-team & judge high-stakes code and architecture decisions.

opencode multi-agent code-review red-team codex ai-agents rag architecture-review openai-codex prompt-engineering llm-as-judge claude-code claude-code-skill agent-skill openclaw hermes-agent agent-arena evidence-checking deliberative-analysis

Updated Jun 9, 2026

minnesotanlp / cobbler

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

nlp evaluation bias bias-detection llm llms llm-evaluation llms-benchmarking llm-as-judge llm-as-a-judge llm-as-evaluator

Updated Feb 16, 2024
Jupyter Notebook

StanfordMIMI / MedVAL

Toward Expert-Level Medical Text Validation with Language Models

medical-text llm-as-judge

Updated Oct 23, 2025
Python

dynatrace-oss / dt-evals

AI evaluators CLI for your AI apps and Agents - Dynatrace AI Observability

ai evaluations agents observability evals llm-as-judge

Updated Jun 5, 2026
TypeScript

nshportun / BestTester

Production-grade Playwright + TypeScript QA framework with AI-powered testing, LLM-as-Judge evaluation, MCP server, 7 CLI agents, security fuzzing, CI/CD pipelines, Jira sync, and Slack reporting — zero-config, plug-and-play.

typescript mutation-testing mcp test-automation ci-cd allure-report api-testing owasp-zap security-testing page-object-model stryker e2e-testing github-actions qa-automation playwright jira-integration ai-testing aws-bedrock llm-as-judge

Updated Apr 21, 2026
TypeScript

johnsonfarmsus / openwebui-ab-mcts-pipeline

Open WebUI Logic Tree Multi-Model LLM Project - Advanced AI reasoning with Sakana AI's AB-MCTS and sophisticated multi-model collaboration

docker machine-learning ai pipeline multi-model monte-carlo-tree-search research-software sakana llm reasoning-engine open-webui llm-as-judge advanced-reasoning open-webui-tools ab-mcts

Updated Oct 10, 2025
Python

dojo.md

edholofy / dojo.md

University for AI agents. 92 courses, 4400+ scenarios, any model via OpenRouter. Auto-training loops generate per-model SKILL.md documents. Works with Claude Code, OpenClaw, Cursor, Windsurf. No fine-tuning required.

Updated May 2, 2026
TypeScript

WesleyPeng / agentic-taf

Agentic Extensible Test Automation Framework

docker bdd selenium atdd requests ui-automation automation-framework paramiko chaos-engineering multi-layer-architecture playwright llm-as-judge

Updated Jun 1, 2026
Python

ChantillyAn / homework-grader

Rubric-driven AI homework grading system built as a Claude Code Skill. Score student submissions with CoT reasoning, bias mitigation, and PDCA quality cycle.

education quality-control batch-processing claude excel-export rubric bias-mitigation anthropic llm-as-judge claude-code ai-grading claude-code-skill homework-grading

Updated Feb 22, 2026
Python

ksm26 / Reinforcement-Fine-Tuning-LLMs-with-GRPO

The course teaches how to fine-tune LLMs using Group Relative Policy Optimization (GRPO)—a reinforcement learning method that improves model reasoning with minimal data. Learn RFT concepts, reward design, LLM-as-a-judge evaluation, and deploy jobs on the Predibase platform.

reinforcement-learning machine-learning-algorithms language-model reward-design rft ai-training deeplearning-ai-courses ai-optimization multi-step-reasoning ai-evaluation rlhf llm-fine-tuning opensource-ai llm-as-judge predibase grpo llm-development token-level-control

Updated Jun 13, 2025
Jupyter Notebook

llm-fullstack-ai-agentic-system

VinZCodz / llm-fullstack-ai-agentic-system

🚀 Production-grade, full-stack agentic ecosystem engineered with DB-first approach using Turso & Drizzle, Generative UI built with Next.js & Server-Sent Events (SSE), a robust LLM-as-Judge evaluation suite. Fully containerized and orchestrated via Kubernetes & Helm, leverages modern DevOps with GitHub Actions & GHCR.io scalable cloud-native deploy

kubernetes express tdd nextjs helm monorepo server-sent-events cloud-native next-js vercel vitest llm drizzle-orm generative-ui llm-as-judge agentic-ai turso-db langraph agentic-devops

Updated Mar 21, 2026
TypeScript

Ufonia / wer-is-unaware

A benchmark, alignment pipeline, and LLM-as-a-Judge for evaluating the clinical impact of ASR errors.

healthcare-ai dspy llm-as-judge gepa

Updated Mar 5, 2026
Python

elizabethfuentes12 / how-to-evaluate-ai-agents-sample-for-aws

Demos for AI agent evaluation: LLM-as-judge, trajectory analysis, hallucination detection, cost benchmarks

evaluation ai-agents cost-optimization opentelemetry hallucination-detection llm-as-judge agent-evaluation strands-agents

Updated May 21, 2026
Jupyter Notebook

AronDaron / dataset-generator

No-code desktop app for generating high-quality synthetic datasets to fine-tune LLMs — plan-then-execute pipeline, LLM-as-judge, HuggingFace upload.

desktop-app nextjs dataset-generation alpaca synthetic-data fine-tuning sft fastapi huggingface llm sharegpt openrouter chatml llm-fine-tuning llm-as-judge

Updated May 20, 2026
Python

2u39u4 / ResearchFlow

Multi-agent research copilot that separates LLM generation from deterministic citation verification — HALLMARK F1-H 0.747 (full dev_public, N=1119). LangGraph pipeline with a failure-driven controller loop, evidence-grounded Critic, and local PDF RAG.

python nlp multi-agent openai arxiv agents rag research-assistant semantic-scholar llm llm-evaluation hallucination-detection langgraph llm-as-judge citation-verification

Updated Jun 8, 2026
Python

SergeiNikolenko / SynthLadder

Synthesis-focused chemistry benchmark and evaluation package for agentic LLMs across reaction understanding, retrosynthesis, and route planning tasks.

benchmark chemistry cheminformatics mass-spectrometry retrosynthesis llm-as-judge agentic-ai pydantic-ai smolagents synthesis-planning openshell

Updated Jun 3, 2026
Python

provnai / vex-halt

VEX-HALT — Benchmark suite for AI verification systems. 443+ tests for calibration, robustness, honesty, and proof integrity.

testing rust benchmark cryptography ai merkle testing-tools ai-evaluation llm-as-judge ai-evaluation-framework

Updated Dec 23, 2025
Rust

Improve this page

Add a description, image, and links to the llm-as-judge topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-as-judge topic, visit your repo's landing page and select "manage topics."