Join us for a live webinar on evaluating LLMs and agentic systems, with a practical end-to-end framework that shows how to combine qualitative review, structured human evaluation, and benchmarks to measure what matters in production.
Large language models and agentic systems are moving quickly from prototypes into production, but knowing how to evaluate them effectively remains one of the biggest challenges teams face.
In this webinar, we’ll explore the full spectrum of LLM and agent evaluation approaches, from lightweight qualitative reviews and “gut checks” to structured human evaluations and automated benchmarks. Rather than framing these methods as tradeoffs, we’ll show how they work best together across different stages of development.
We’ll dig into where human judgment is essential: ev
aluating usefulness, reasoning quality, safety, and alignment with real user needs. You’ll learn why benchmarks alone often fall short, how to avoid common evaluation pitfalls, and how to incorporate human review at scale without slowing teams down.
You’ll walk away with:
Machine Learning Evangelist, HumanSignal
Micaela Kaplan is the Machine Learning Evangelist at HumanSignal. With her background in applied Data Science and a masters in Computational Linguistics, she loves helping other understand AI tools and practices.