Academic Personal AI Infrastructure
-
Updated
May 8, 2026 - Python
Academic Personal AI Infrastructure
Lumen — learner-owned AI education platform. Tell the AI what you want to learn: it builds you a private course in ~a minute, tutors you with course-scoped RAG + citations, and lets you share, clone & remix via a moderated catalog. BYOK, custom no-LangChain multi-agent orchestrator, golden evals in CI, MCP server. Live demo + public /eval.
LLM and agent evaluation for Java & Kotlin. Runs in JUnit and CI. Spring AI, LangChain4j, Koog, Embabel, and any LLM client.
Evidence-first multi-agent debate skill: get a second opinion by pitting Codex × Claude Code (or GLM/DeepSeek/Qwen) to independently review, red-team & judge high-stakes code and architecture decisions.
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
Toward Expert-Level Medical Text Validation with Language Models
AI evaluators CLI for your AI apps and Agents - Dynatrace AI Observability
Production-grade Playwright + TypeScript QA framework with AI-powered testing, LLM-as-Judge evaluation, MCP server, 7 CLI agents, security fuzzing, CI/CD pipelines, Jira sync, and Slack reporting — zero-config, plug-and-play.
Open WebUI Logic Tree Multi-Model LLM Project - Advanced AI reasoning with Sakana AI's AB-MCTS and sophisticated multi-model collaboration
University for AI agents. 92 courses, 4400+ scenarios, any model via OpenRouter. Auto-training loops generate per-model SKILL.md documents. Works with Claude Code, OpenClaw, Cursor, Windsurf. No fine-tuning required.
Agentic Extensible Test Automation Framework
Rubric-driven AI homework grading system built as a Claude Code Skill. Score student submissions with CoT reasoning, bias mitigation, and PDCA quality cycle.
The course teaches how to fine-tune LLMs using Group Relative Policy Optimization (GRPO)—a reinforcement learning method that improves model reasoning with minimal data. Learn RFT concepts, reward design, LLM-as-a-judge evaluation, and deploy jobs on the Predibase platform.
🚀 Production-grade, full-stack agentic ecosystem engineered with DB-first approach using Turso & Drizzle, Generative UI built with Next.js & Server-Sent Events (SSE), a robust LLM-as-Judge evaluation suite. Fully containerized and orchestrated via Kubernetes & Helm, leverages modern DevOps with GitHub Actions & GHCR.io scalable cloud-native deploy
A benchmark, alignment pipeline, and LLM-as-a-Judge for evaluating the clinical impact of ASR errors.
Demos for AI agent evaluation: LLM-as-judge, trajectory analysis, hallucination detection, cost benchmarks
No-code desktop app for generating high-quality synthetic datasets to fine-tune LLMs — plan-then-execute pipeline, LLM-as-judge, HuggingFace upload.
Multi-agent research copilot that separates LLM generation from deterministic citation verification — HALLMARK F1-H 0.747 (full dev_public, N=1119). LangGraph pipeline with a failure-driven controller loop, evidence-grounded Critic, and local PDF RAG.
Synthesis-focused chemistry benchmark and evaluation package for agentic LLMs across reaction understanding, retrosynthesis, and route planning tasks.
VEX-HALT — Benchmark suite for AI verification systems. 443+ tests for calibration, robustness, honesty, and proof integrity.
Add a description, image, and links to the llm-as-judge topic page so that developers can more easily learn about it.
To associate your repository with the llm-as-judge topic, visit your repo's landing page and select "manage topics."