Name: AI Evals For Engineers, PMs
Price: 79 USD
Availability: InStock
Author: Parlance Labs

What you'll learn

Build comprehensive evaluation frameworks for LLM applications and AI systems.

Design custom metrics and benchmarks to measure AI model performance accurately.

Implement automated testing pipelines for continuous evaluation of AI outputs.

Evaluate retrieval quality, generation accuracy, and reasoning capabilities in AI systems.

Create evaluation datasets and test cases for real-world AI applications.

Apply best practices for comparing models, prompts, and system configurations.

Monitor and debug AI systems using evaluation-driven development methodologies.

Make data-driven decisions about model selection, deployment, and iteration.

Explore related topics

Artificial Intelligence (AI)

This course includes:

29.36 hours on-demand video

41 videos

10 documents

8 GB downloadable resources

Access on mobile and PC

Instant access after payment

Course content

8 sections • 52 lectures • 29h 35m total length

Expand all sections

AI Evals

01:00

1 HW 1&2 walkthrough with Braintrust (pre recorded) 1

10:50
1 HW 1&2 walkthrough with Braintrust (pre recorded) 2

05:13
2 HW 1&2 walkthrough with Phoenix (pre recorded)

15:04
3 HW 1&2 walkthrough with Lang Smith (pre recorded)

22:41

1 HW 3 walkthrough with Braintrust (pre recorded)

21:41
2 HW 3 walkthrough with Phoenix (pre recorded)

16:39

1 HW 4 walkthrough with Braintrust (pre recorded)

23:11
2 HW 4 walkthrough with Phoenix (pre recorded)

16:38

1 HW 5 walkthrough with Braintrust (pre recorded)

22:03
2 HW 5 walkthrough with Phoenix (pre recorded)

14:57

1 Lesson 1%3 A Fundamentals & Lifecycle LLM Application Evaluation

56:40
2 Lesson 2%3 A Systematic Error Analysis

1:01:39
3 Braintrust Tutorial w Wayde Gilliam

43:02
4 Optional%3 A Office Hours

1:40:14
AIE Braintrust Intro

05:00
Lesson 1

01:00
Lesson 2

01:00

5 Lesson 3%3 A More Error Analysis & Collaborative Evaluation

59:34
6 Lesson 4%3 A Automated Evaluators

1:00:34
7 Taming diffusion QR codes with evals and inference time scaling w Charles Frye

44:43
8 10 x Your RAG Evaluation by Avoiding these Pitfalls w Skylar Payne

28:25
9 Optional%3 A Office Hours

1:18:26
10 Optional%3 A Office Hours

47:11
Lesson 3

01:00
Lesson 4

01:00

11 Lesson 5%3 A More Automated Evaluators

05:13
12 Lesson 6%3 A RAG & Complex Architectures

59:46
13 Scaling Inference Time Compute for Better LLM Judges w Leonard Tang

31:08
14 Building custom eval tools with coding agents w Isaac Flath

46:39
15 From Vibe Checks to Evals to Feedback Loops Case Studies in Al System Maturities w David Karam

30:02
16 A Playbook For Building Al Agents You Can Trust w Udi Menkes

38:25
17 Al Evals in Vertical Industries (such as healthcare, finance and law) w Dr Chris Lovejoy

34:15
18 Arize Phoenix tutorial W Mikyo King

49:02
19 Optional%3 A Office Hours

22:32
20 Optional%3 A Office Hours

24:20
21 Optional%3 A Office Hours

55:49
Building Custom Eval Tools with cod

01:00
Lesson 5

01:00
Lesson 6

01:00

22 Lesson 7%3 A Efficient Continuous Human Review Systems

59:02
23 Lesson 8%3 A Cost Optimization

1:03:11
24 Techniques for evaluating agents w Sally Ann De Lucia (Arize)

33:37
25 Lang Smith Tutorial w Harrison Chase

48:24
26 From Noob to 5 Automated Evals in 4 Weeks (as a PM) w Teresa Torres

1:10:21
27 Solvelt%3 A The Thinking Developer's Environment w Jeremy Howard & Johno Whitaker

1:42:26
28 Testing Real Al Products LIVE w Robert Ta

1:00:49
29 Fireside Chat with DSP Creator w Omar Khattab

44:59
30 Optional%3 A Office Hours

1:06:30
31 Optional%3 A Office Hours (Bonus)

1:05:26
Lesson 7

01:00
Lesson 8

01:00

Requirements

Basic understanding of large language models and AI applications.
Familiarity with programming concepts and software development workflows.
Experience working with or building AI products is helpful but not required.
Access to a computer with internet connection for hands-on exercises.

Description

Evaluating artificial intelligence systems has become one of the most critical skills for anyone building or managing AI products. As large language models and generative AI become embedded in production applications, the ability to systematically measure their performance, reliability, and safety is no longer optional. This comprehensive course provides engineers and product managers with the frameworks, methodologies, and practical tools needed to build robust evaluation systems for AI applications.

The course begins by establishing a solid foundation in evaluation theory and practice. You will learn why traditional software testing approaches fall short when applied to AI systems, and how probabilistic outputs, contextual dependencies, and emergent behaviors require fundamentally different evaluation strategies. This section introduces the core concepts of AI evaluation, including the distinction between intrinsic and extrinsic metrics, the role of human judgment in evaluation pipelines, and the trade-offs between automated and manual assessment methods.

Once the conceptual groundwork is laid, the course moves into designing evaluation frameworks tailored to specific AI use cases. You will explore how to define success criteria for different types of AI applications, from question-answering systems and content generation tools to retrieval-augmented generation pipelines and agentic systems. This involves identifying the dimensions of quality that matter most for your application, such as factual accuracy, relevance, coherence, safety, and latency. You will learn how to translate business requirements into measurable evaluation metrics and how to balance multiple competing objectives within a single evaluation framework.

The next phase focuses on building evaluation datasets and test suites. Creating high-quality evaluation data is often the most time-consuming and challenging aspect of AI evaluation. You will learn strategies for generating diverse test cases that cover edge cases, adversarial inputs, and representative real-world scenarios. The course covers techniques for synthetic data generation, data augmentation, and crowdsourced evaluation, as well as best practices for versioning and maintaining evaluation datasets as your AI system evolves. You will also explore how to design golden datasets with ground truth labels and how to handle situations where ground truth is ambiguous or unavailable.

With evaluation frameworks and datasets in place, the course turns to implementation. You will learn how to build automated evaluation pipelines that run continuously as part of your development workflow. This includes integrating evaluation into CI/CD systems, setting up monitoring dashboards, and creating alerting mechanisms for performance degradation. You will work with common evaluation libraries and tools used in industry, understanding how to configure them for your specific needs and how to extend them with custom metrics. The course also covers prompt evaluation, showing you how to systematically compare different prompt formulations and select the most effective ones based on quantitative evidence.

Model comparison and selection receive focused attention in a dedicated section. You will learn rigorous methods for comparing the performance of different models, including statistical significance testing, confidence intervals, and handling variance in evaluation results. The course teaches you how to conduct fair comparisons that account for differences in model size, cost, and latency, and how to make informed trade-off decisions when no single model dominates across all metrics. You will also explore multi-model evaluation scenarios and ensemble approaches.

The course addresses advanced evaluation challenges that arise in production systems. You will learn how to evaluate retrieval quality in RAG systems, including metrics for measuring relevance, coverage, and ranking performance. For agentic AI systems, the course covers evaluating multi-step reasoning, tool use, and decision-making capabilities. You will explore techniques for evaluating safety, bias, and fairness, and learn how to implement guardrails based on evaluation results. The course also discusses evaluating AI systems at scale, including sampling strategies, cost management, and balancing evaluation thoroughness with practical constraints.

Throughout the course, emphasis is placed on evaluation-driven development as a mindset and methodology. You will learn how to use evaluation as a feedback loop for continuous improvement, how to prioritize development efforts based on evaluation insights, and how to communicate evaluation results effectively to stakeholders. By the end, you will have the skills and confidence to design, implement, and maintain comprehensive evaluation systems that ensure your AI applications meet quality standards and deliver reliable value in production environments.

Who this course is for:

AI Evals For Engineers, PMs is designed for engineers building LLM applications who need to measure and improve model performance, product managers overseeing AI products who want to make informed decisions based on systematic evaluation, technical leaders responsible for ensuring quality and reliability in AI systems, developers transitioning into AI engineering who need practical evaluation skills, and anyone working with AI who wants to move beyond subjective assessment to rigorous, data-driven evaluation methods.

Instructor

Parlance Labs

AI Evaluation Research and Education Organization

About Me

We are a specialized AI research and education organization focused on advancing the practical application of evaluation methodologies in artificial intelligence systems. Our work sits at the intersection of machine learning engineering, product development, and quality assurance, where we explore the critical question of how to measure and improve AI system performance in real-world applications.

Our expertise emerged from years of working directly with production AI systems and confronting the challenges that traditional software testing approaches cannot address. We recognized early that as AI models become more capable and complex, the need for rigorous evaluation frameworks becomes paramount. This insight drove us to develop comprehensive methodologies for assessing large language models, retrieval systems, and agentic AI applications.

Our approach combines theoretical rigor with practical applicability. We believe that effective evaluation requires understanding both the mathematical foundations of metrics and the messy realities of production systems. We work extensively with automated evaluation pipelines, custom metric design, and the integration of human judgment into scalable assessment workflows. Our methods have been refined through collaboration with engineering teams building diverse AI applications, from customer service chatbots to complex reasoning systems.

We place particular emphasis on evaluation-driven development as a practice and mindset. Rather than treating evaluation as an afterthought or checkpoint, we advocate for making it central to the development process. Our frameworks help teams identify failure modes early, compare alternatives systematically, and make data-informed decisions about model selection, prompt engineering, and system architecture.

Our commitment extends to making evaluation knowledge accessible and actionable. We translate cutting-edge research into practical guidance that engineers and product managers can implement immediately. We focus on real challenges like handling ambiguous ground truth, managing evaluation costs at scale, and communicating results to non-technical stakeholders. Through our educational programs, we aim to elevate evaluation practices across the AI industry and help teams build more reliable, trustworthy AI systems.