LLM applications are frustratingly difficult to test due to their probabilistic nature. However, testing is crucial for customer-facing applications to ensure the reliability of generated answers. So, how does one effectively test an LLM app? Enter Confident AI's DeepEval: a comprehensive open-source LLM evaluation framework with excellent developer experience. Key features of DeepEval: - Ease of use: Very similar to writing unit tests with pytest. - Comprehensive suite of metrics: 14+ research-backed metrics for relevancy, hallucination, etc., including label-less standard metrics, which can quantify your bot's performance even without labeled ground truth! All you need is input and output from the bot. See the list of metrics and required data in the image below! - Custom Metrics: Tailor your evaluation process by defining your custom metrics as your business requires. - Synthetic data generator: Create an evaluation dataset synthetically to bootstrap your tests My recommendations for LLM evaluation: - Use OpenAI GPT4 as the metric model as much as possible. - Test Dataset Generation: Use the DeepEval Synthesizer to generate a comprehensive set of realistic questions! Bulk Evaluation: If you are running multiple metrics on multiple questions, generate the responses once, store them in a pandas data frame, and calculate all the metrics in bulk with parallelization. - Quantify hallucination: I love the faithfulness metric, which indicates how much of the generated output is factually consistent with the context provided by the retriever in RAG! CI/CD: Run these tests automatically in your CI/CD pipeline to ensure every code change and prompt change doesn't break anything. - Guardrails: Some high-speed tests can be run on every API call in a post-processor before responding to the user. Leave the slower tests for CI/CD. 🌟 DeepEval GitHub: https://lnkd.in/g9VzqPqZ 🔗 DeepEval Bulk evaluation: https://lnkd.in/g8DQ9JAh Let me know in the comments if you have other ways to test LLM output systematically! Follow me for more tips on building successful ML and LLM products! Medium: https://lnkd.in/g2jAJn5 X: https://lnkd.in/g_JbKEkM #generativeai #llm #nlp #artificialintelligence #mlops #llmops
LLM Evaluation Tools
Explore top LinkedIn content from expert professionals.
-
-
Explaining the Evaluation method LLM-as-a-Judge (LLMaaJ). Token-based metrics like BLEU or ROUGE are still useful for structured tasks like translation or summarization. But for open-ended answers, RAG copilots, or complex enterprise prompts, they often miss the bigger picture. That’s where LLMaaJ changes the game. 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗶𝘁? You use a powerful LLM as an evaluator, not a generator. It’s given: - The original question - The generated answer - And the retrieved context or gold answer 𝗧𝗵𝗲𝗻 𝗶𝘁 𝗮𝘀𝘀𝗲𝘀𝘀𝗲𝘀: ✅ Faithfulness to the source ✅ Factual accuracy ✅ Semantic alignment—even if phrased differently 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀: LLMaaJ captures what traditional metrics can’t. It understands paraphrasing. It flags hallucinations. It mirrors human judgment, which is critical when deploying GenAI systems in the enterprise. 𝗖𝗼𝗺𝗺𝗼𝗻 𝗟𝗟𝗠𝗮𝗮𝗝-𝗯𝗮𝘀𝗲𝗱 𝗺𝗲𝘁𝗿𝗶𝗰𝘀: - Answer correctness - Answer faithfulness - Coherence, tone, and even reasoning quality 📌 If you’re building enterprise-grade copilots or RAG workflows, LLMaaJ is how you scale QA beyond manual reviews. To put LLMaaJ into practice, check out EvalAssist; a new tool from IBM Research. It offers a web-based UI to streamline LLM evaluations: - Refine your criteria iteratively using Unitxt - Generate structured evaluations - Export as Jupyter notebooks to scale effortlessly A powerful way to bring LLM-as-a-Judge into your QA stack. - Get Started guide: https://lnkd.in/g4QP3-Ue - Demo Site: https://lnkd.in/gUSrV65s - Github Repo: https://lnkd.in/gPVEQRtv - Whitepapers: https://lnkd.in/gnHi6SeW
-
Whether you're using RAG or AI agents -- you want to make sure they respond with "I don't know" instead of answering incorrectly. Cleanlab has come up with "TLM" which does this pretty well -- - The Trustworthy Language Model (TLM) uses a scoring system to evaluate LLM responses based on their trustworthiness. It flags answers that may be incorrect, letting you know when to ignore them. - TLM works in real-time, assessing the responses of models like GPT-4o. When the trustworthiness score drops below a threshold of 0.25, TLM overrides the response with a standard "I don’t know" answer to prevent misinformation. - The system doesn’t just stop at filtering. TLM also improves responses automatically, making the output less error-prone without modifying the LLM or its prompts, which saves time in the revision process. - For high-stakes applications, a stricter threshold of 0.8 can be set, which drastically drops incorrect responses by over 84%. But this has to be balanced, because a higher threshold means that some correct responses will also be filtered. - This approach allows for a more reliable interaction with LLMs, especially when dealing with fact-based queries, which helps maintain user trust and enhances the overall quality of responses. Link to the article: https://lnkd.in/gdM5BE9M #AI #LLMs #RAG
-
Here is how you can install an open-source, enterprise-grade RAG system on your server (with the best document understanding I've seen.) First, something obvious to anyone trying to sell RAG in the market: You are crazy if you think companies will let their data travel to a hosted model. No one wants to send their data anywhere (those who do haven't found an alternative.) Every single company would rather have an air-gapped system with no internet access. GroundX is an open-source RAG system that you can run on your servers (or any cloud provider, as long as you have access to GPUs) and works without a network. (If the military wants to do RAG, this is precisely what they will be looking for.) I installed GroundX on my AWS account and recorded a video to show you how to use it. There are two services you can use: 1. Ingest: This service uses a pretrained vision model to ingest and understand your knowledge base. 2. Search: This service combines text and vector search with a fine-tuned re-ranker model to retrieve information from your knowledge base. A quick note about the Ingest service: 99% of people think they need better "retrieval" mechanisms. I think they need better "ingestion." That's where this service comes in! Ingest "understands" your documents in a way I haven't seen before. After you try it, you'll realize why showing your LLM your raw documents is a bad idea. In the video, I use a free tool called X-Ray to test a document and understand how the Ingest service breaks it down. You can access this tool by signing up for a free GroundX cloud account and uploading your documents. You'll see a bit more about this in the video. This is a game-changer for anyone who wants world-class RAG performance with top-notch security. Here is GroundX's on-prem website: https://lnkd.in/eCvCd_jv Sign up for a cloud account to start testing your documents for free. Then download the open-source repository and follow the instructions on this GitHub repository: https://lnkd.in/eY8zNavm Disclaimer: I've been working with the team behind GroundX for a year+ now. I believe they have built one of the best RAG ecosystems in the world.
-
A new study shows that even the best financial LLMs hallucinate 41% of the time when faced with unexpected inputs. FailSafeQA, a new benchmark from Writer, tests LLM robustness in finance by simulating real-world mishaps, including misspelled queries, incomplete questions, irrelevant documents, and OCR-induced errors. Evaluating 24 top models revealed that: * OpenAI’s o3-mini, the most robust, hallucinated in 41% of perturbed cases * Palmyra-Fin-128k-Instruct, the model best at refusing irrelevant queries, still struggled 17% of the time FailSafeQA uniquely measures: (1) Robustness - performance across query perturbations (e.g., misspelled, incomplete) (2) Context Grounding - the ability to avoid hallucinations when context is missing or irrelevant (3) Compliance - balancing robustness and grounding to minimize false responses Developers building financial applications should implement explicit error handling that gracefully addresses context issues, rather than solely relying on model robustness. Developing systems to proactively detect and respond to problematic queries can significantly reduce costly hallucinations and enhance trust in LLM-powered financial apps. Benchmark details https://lnkd.in/gq-mijcD
-
Happy Friday! This week in #learnwithmz, I’m building on my recent post about running LLMs/SLMs locally: https://lnkd.in/gpz3kXhD Since sharing that, the landscape has rapidly evolved, local LLM tooling is more capable and deployment-ready than ever. In fact, at a conference last week, I was asked twice about private model hosting. Clearly, the demand is real. So let's dive deeper into the frameworks making local inference faster, easier, and more scalable. Ollama (Most User-Friendly) Run models like llama3, phi-3, and deepseek with one command. https://ollama.com/ llama.cpp (Lightweight & C++-based) Fast inference engine for quantized models. https://lnkd.in/ghxrSnY3 MLC LLM (Cross-Platform Compiler Stack) Runs LLMs on iOS, Android, and Web via TVM. https://mlc.ai/mlc-llm/ ONNX Runtime (Enterprise-Ready) Cross-platform, hardware-accelerated inference from Microsoft. https://onnxruntime.ai/ LocalAI (OpenAI API-Compatible Local Inference) Self-hosted server with model conversion, whisper integration, and multi-backend support. https://lnkd.in/gi4N8v5H LM Studio (Best UI for Desktop) A polished desktop interface to chat with local models. https://lmstudio.ai/ Qualcomm AI Hub (For Snapdragon-powered Devices) Deploy LLMs optimized for mobile and edge hardware. https://lnkd.in/geDVwRb7 LiteRT (short for Lite Runtime), formerly known as TensorFlow Lite Still solid for embedded and mobile deployments. https://lnkd.in/g2QGSt9H CoreML (Apple) Optimized for deploying LLMs on Apple devices using Apple Silicon + Neural Engine. https://lnkd.in/gBvkj_CP MediaPipe (Google) Optimized for LLM inference on Android devices. https://lnkd.in/gZJzTcrq Nexa AI SDK (Nexa AI) Cross-platform SDK for integrating LLMs directly into mobile apps. https://lnkd.in/gaVwv7-5 Why Local LLMs Matter? - Edge AI and privacy-first features are rising - Cost, latency, and sovereignty concerns are real - Mobile + Desktop + Web apps need on-device capabilities - Developers + PMs: This is your edge. Building products with LLMs doesn't always need the cloud. Start testing local-first workflows. What stack are you using or exploring? #AI #LLMs #EdgeAI #OnDeviceAI #AIInfra #ProductManagement #Privacy #AItools #learnwithmz
-
Nicolas Yax just turned LLMs into DNA samples. 🧬 And discovered how to trace their family trees. PhyloLM applies genetic analysis to language models, revealing hidden relationships even in closed-source models where training details are secret. The framework is brilliantly simple: ▪️ LLMs = populations ▪️ Prompts = genes ▪️ Generated tokens = alleles By calculating genetic distances between model outputs, PhyloLM creates evolutionary trees showing which models share "ancestry." 🔍 What They Found: Using only generated tokens, they proved NeuralHermes was based on OpenHermes. No access to weights. No training logs. Just outputs. Think about that for a second. 📊 Why This Matters: 1️⃣ Model Attribution: Finally, a way to detect when someone fine-tuned your model without credit 2️⃣ Architecture Detective: Reveals shared training data or methods between models 3️⃣ Closed-Source Analysis: Works even when companies hide their model details 🧪 The Method: • Feed identical prompts to different models • Analyze token generation patterns • Calculate "genetic distance" between outputs • Build phylogenetic trees showing relationships This is a forensic tool for the AI age. When everyone's building on everyone else's work, PhyloLM shows the real family tree. Nicolas released everything, find the links in the comments 👇
-
Testing and evaluating LLM outputs when updating your prompts is a daunting task. Starting with a simple trial-and-error approach, it becomes both time-consuming and inefficient... 💡 What about test-driven LLM development? I recently discovered promptfoo (https://lnkd.in/ewpgwgKk), an open-source framework for systematically improving LLM prompt quality, which helped me a lot: • Systematic Testing: Predefined test cases bring consistency and comprehensiveness to the testing process. • Quality Evaluation: Side-by-side comparisons of LLM outputs are now streamlined, making it easier to spot differences and improvements. • Efficiency Boost: Caching and concurrency features significantly speed up the evaluation process. • Automatic Scoring: Test cases can now automatically score outputs, removing subjectivity and enhancing objectivity. • Versatility: a Command Line Interface (CLI), a library, or integration in Continuous Integration/Continuous Deployment (CI/CD) • Compatibility: The tool supports a wide range of LLM APIs, from OpenAI to HuggingFace, and even custom API providers. 🔥The ultimate aim is to shift from a trial-and-error methodology to a test-driven LLM development process. This saves valuable time and ensures higher quality and reliability in the models we develop and deploy. For anyone involved in LLM development, promptfoo is an excellent library to put in place locally or in a CI/CD testing step! 🗞️ More information on the blog post: https://lnkd.in/edR--kfj #ai #llm #machinelearning #testing #promptengineering
-
Another great paper on reasoning LLM efficiency. This one focuses on the relationship between reasoning length and model performance using diverse compression instructions (e.g., 'use 10 words or less'). These papers provide good tips for how to leverage reasoning LLMs. Specifically, it investigates how LLMs balance chain-of-thought (CoT) reasoning length against accuracy. It introduces token complexity, a minimal token threshold needed for correct problem-solving, and shows that even seemingly different CoT “compression prompts” (like “use bullet points” or “remove grammar”) fall on the same universal accuracy–length trade-off curve. Key highlights include: • Universal accuracy–length trade-off – Despite prompting LLMs in diverse ways to shorten reasoning (e.g. “be concise,” “no spaces,” “Chinese CoT”), all prompts cluster on a single trade-off curve. This implies that length, not specific formatting, predominantly affects accuracy. • Token complexity as a threshold – For each question, there’s a sharp cutoff in tokens required to yield the correct answer. If the LLM’s CoT is shorter than this “token complexity,” it fails. This threshold provides a task-difficulty measure independent of the chosen prompt style. • Information-theoretic upper bound – By treating CoT compression as a “lossy coding” problem, the authors derive theoretical limits on how short a correct reasoning chain can be. Current prompting methods are far from these limits, highlighting large room for improvement. • Importance of adaptive compression – The best strategy would match CoT length to problem difficulty, using minimal tokens for easy questions and more thorough CoTs for harder ones. Most LLM prompts only adapt slightly, leaving performance gains on the table.
-
SumAsk, some don't: Solving one of the hardest NLP tasks without finetuning 😳 Relationship Extraction (RE) is one of those things that may sound easy at first but is really, really hard. Recent advancements in Large Language Models (LLMs) like ChatGPT have raised the question: Can we bypass traditional data labeling for RE? This study explores how LLMs can serve as zero-shot relation extractors, optimizing the process via innovative prompt designs. An approach called SumAsk uses a summarize-and-ask technique that converts RE tasks into a question-answering format that LLMs can effectively address. Results not only highlight a significant boost in the performance of LLMs on various benchmarks but also demonstrate that ChatGPT can achieve competitive or even superior outcomes compared to other zero-shot and even fully supervised methods. While this doesn't spell the end of traditional methods, the implications for speed and accuracy in information extraction are significant. Picture more nuanced data analysis at a fraction of current times without the need of finetuning again and again for every single problem you're facing. If you like LLMs or not, they will definitely make data analysis more flexible than ever. [arXiv] https://lnkd.in/dqFb69Xc ↓ Liked this post? Get weekly AI highlights and papers-of-the-week directly to your inbox 👉 llmwatch.com