What you'll learn
Explore related topics
This course includes:
Course content
-
AI Evals01:00
-
1 HW 1&2 walkthrough with Braintrust (pre recorded) 110:50
-
1 HW 1&2 walkthrough with Braintrust (pre recorded) 205:13
-
2 HW 1&2 walkthrough with Phoenix (pre recorded)15:04
-
3 HW 1&2 walkthrough with Lang Smith (pre recorded)22:41
-
1 HW 3 walkthrough with Braintrust (pre recorded)21:41
-
2 HW 3 walkthrough with Phoenix (pre recorded)16:39
-
1 HW 4 walkthrough with Braintrust (pre recorded)23:11
-
2 HW 4 walkthrough with Phoenix (pre recorded)16:38
-
1 HW 5 walkthrough with Braintrust (pre recorded)22:03
-
2 HW 5 walkthrough with Phoenix (pre recorded)14:57
-
1 Lesson 1%3 A Fundamentals & Lifecycle LLM Application Evaluation56:40
-
2 Lesson 2%3 A Systematic Error Analysis1:01:39
-
3 Braintrust Tutorial w Wayde Gilliam43:02
-
4 Optional%3 A Office Hours1:40:14
-
AIE Braintrust Intro05:00
-
Lesson 101:00
-
Lesson 201:00
-
5 Lesson 3%3 A More Error Analysis & Collaborative Evaluation59:34
-
6 Lesson 4%3 A Automated Evaluators1:00:34
-
7 Taming diffusion QR codes with evals and inference time scaling w Charles Frye44:43
-
8 10 x Your RAG Evaluation by Avoiding these Pitfalls w Skylar Payne28:25
-
9 Optional%3 A Office Hours1:18:26
-
10 Optional%3 A Office Hours47:11
-
Lesson 301:00
-
Lesson 401:00
-
11 Lesson 5%3 A More Automated Evaluators05:13
-
12 Lesson 6%3 A RAG & Complex Architectures59:46
-
13 Scaling Inference Time Compute for Better LLM Judges w Leonard Tang31:08
-
14 Building custom eval tools with coding agents w Isaac Flath46:39
-
15 From Vibe Checks to Evals to Feedback Loops Case Studies in Al System Maturities w David Karam30:02
-
16 A Playbook For Building Al Agents You Can Trust w Udi Menkes38:25
-
17 Al Evals in Vertical Industries (such as healthcare, finance and law) w Dr Chris Lovejoy34:15
-
18 Arize Phoenix tutorial W Mikyo King49:02
-
19 Optional%3 A Office Hours22:32
-
20 Optional%3 A Office Hours24:20
-
21 Optional%3 A Office Hours55:49
-
Building Custom Eval Tools with cod01:00
-
Lesson 501:00
-
Lesson 601:00
-
22 Lesson 7%3 A Efficient Continuous Human Review Systems59:02
-
23 Lesson 8%3 A Cost Optimization1:03:11
-
24 Techniques for evaluating agents w Sally Ann De Lucia (Arize)33:37
-
25 Lang Smith Tutorial w Harrison Chase48:24
-
26 From Noob to 5 Automated Evals in 4 Weeks (as a PM) w Teresa Torres1:10:21
-
27 Solvelt%3 A The Thinking Developer's Environment w Jeremy Howard & Johno Whitaker1:42:26
-
28 Testing Real Al Products LIVE w Robert Ta1:00:49
-
29 Fireside Chat with DSP Creator w Omar Khattab44:59
-
30 Optional%3 A Office Hours1:06:30
-
31 Optional%3 A Office Hours (Bonus)1:05:26
-
Lesson 701:00
-
Lesson 801:00
Requirements
- Basic understanding of large language models and AI applications.
- Familiarity with programming concepts and software development workflows.
- Experience working with or building AI products is helpful but not required.
- Access to a computer with internet connection for hands-on exercises.
Description
Evaluating artificial intelligence systems has become one of the most critical skills for anyone building or managing AI products. As large language models and generative AI become embedded in production applications, the ability to systematically measure their performance, reliability, and safety is no longer optional. This comprehensive course provides engineers and product managers with the frameworks, methodologies, and practical tools needed to build robust evaluation systems for AI applications.
The course begins by establishing a solid foundation in evaluation theory and practice. You will learn why traditional software testing approaches fall short when applied to AI systems, and how probabilistic outputs, contextual dependencies, and emergent behaviors require fundamentally different evaluation strategies. This section introduces the core concepts of AI evaluation, including the distinction between intrinsic and extrinsic metrics, the role of human judgment in evaluation pipelines, and the trade-offs between automated and manual assessment methods.
Once the conceptual groundwork is laid, the course moves into designing evaluation frameworks tailored to specific AI use cases. You will explore how to define success criteria for different types of AI applications, from question-answering systems and content generation tools to retrieval-augmented generation pipelines and agentic systems. This involves identifying the dimensions of quality that matter most for your application, such as factual accuracy, relevance, coherence, safety, and latency. You will learn how to translate business requirements into measurable evaluation metrics and how to balance multiple competing objectives within a single evaluation framework.
The next phase focuses on building evaluation datasets and test suites. Creating high-quality evaluation data is often the most time-consuming and challenging aspect of AI evaluation. You will learn strategies for generating diverse test cases that cover edge cases, adversarial inputs, and representative real-world scenarios. The course covers techniques for synthetic data generation, data augmentation, and crowdsourced evaluation, as well as best practices for versioning and maintaining evaluation datasets as your AI system evolves. You will also explore how to design golden datasets with ground truth labels and how to handle situations where ground truth is ambiguous or unavailable.
With evaluation frameworks and datasets in place, the course turns to implementation. You will learn how to build automated evaluation pipelines that run continuously as part of your development workflow. This includes integrating evaluation into CI/CD systems, setting up monitoring dashboards, and creating alerting mechanisms for performance degradation. You will work with common evaluation libraries and tools used in industry, understanding how to configure them for your specific needs and how to extend them with custom metrics. The course also covers prompt evaluation, showing you how to systematically compare different prompt formulations and select the most effective ones based on quantitative evidence.
Model comparison and selection receive focused attention in a dedicated section. You will learn rigorous methods for comparing the performance of different models, including statistical significance testing, confidence intervals, and handling variance in evaluation results. The course teaches you how to conduct fair comparisons that account for differences in model size, cost, and latency, and how to make informed trade-off decisions when no single model dominates across all metrics. You will also explore multi-model evaluation scenarios and ensemble approaches.
The course addresses advanced evaluation challenges that arise in production systems. You will learn how to evaluate retrieval quality in RAG systems, including metrics for measuring relevance, coverage, and ranking performance. For agentic AI systems, the course covers evaluating multi-step reasoning, tool use, and decision-making capabilities. You will explore techniques for evaluating safety, bias, and fairness, and learn how to implement guardrails based on evaluation results. The course also discusses evaluating AI systems at scale, including sampling strategies, cost management, and balancing evaluation thoroughness with practical constraints.
Throughout the course, emphasis is placed on evaluation-driven development as a mindset and methodology. You will learn how to use evaluation as a feedback loop for continuous improvement, how to prioritize development efforts based on evaluation insights, and how to communicate evaluation results effectively to stakeholders. By the end, you will have the skills and confidence to design, implement, and maintain comprehensive evaluation systems that ensure your AI applications meet quality standards and deliver reliable value in production environments.
Who this course is for:
AI Evals For Engineers, PMs is designed for engineers building LLM applications who need to measure and improve model performance, product managers overseeing AI products who want to make informed decisions based on systematic evaluation, technical leaders responsible for ensuring quality and reliability in AI systems, developers transitioning into AI engineering who need practical evaluation skills, and anyone working with AI who wants to move beyond subjective assessment to rigorous, data-driven evaluation methods.Instructor
Parlance Labs
About Me
We are a specialized AI research and education organization focused on advancing the practical application of evaluation methodologies in artificial intelligence systems. Our work sits at the intersection of machine learning engineering, product development, and quality assurance, where we explore the critical question of how to measure and improve AI system performance in real-world applications.
Our expertise emerged from years of working directly with production AI systems and confronting the challenges that traditional software testing approaches cannot address. We recognized early that as AI models become more capable and complex, the need for rigorous evaluation frameworks becomes paramount. This insight drove us to develop comprehensive methodologies for assessing large language models, retrieval systems, and agentic AI applications.
Our approach combines theoretical rigor with practical applicability. We believe that effective evaluation requires understanding both the mathematical foundations of metrics and the messy realities of production systems. We work extensively with automated evaluation pipelines, custom metric design, and the integration of human judgment into scalable assessment workflows. Our methods have been refined through collaboration with engineering teams building diverse AI applications, from customer service chatbots to complex reasoning systems.
We place particular emphasis on evaluation-driven development as a practice and mindset. Rather than treating evaluation as an afterthought or checkpoint, we advocate for making it central to the development process. Our frameworks help teams identify failure modes early, compare alternatives systematically, and make data-informed decisions about model selection, prompt engineering, and system architecture.
Our commitment extends to making evaluation knowledge accessible and actionable. We translate cutting-edge research into practical guidance that engineers and product managers can implement immediately. We focus on real challenges like handling ambiguous ground truth, managing evaluation costs at scale, and communicating results to non-technical stakeholders. Through our educational programs, we aim to elevate evaluation practices across the AI industry and help teams build more reliable, trustworthy AI systems.
Relative Courses