Talks - Devoxx UK

Beyond the Score: Your Guide to Benchmarking LLMs

Conference (INTERMEDIATE level)

How to select LLMs to deploy on a GPU cluster for hundreds of engineers? LLM Benchmark results are proudly presented for every new model, but what's behind the numbers? We guide you to designing benchmarks that actually measure the real-world problem-solving skills of LLMs.
In this talk, we discuss methods for measuring business-relevant LLM skills such as coding, tool calling and performance in agentic workflows. After exposing some of the issues with popular benchmarks, we show you different methods to evaluate LLMs. Discover techniques such as evaluations with ground truth, LLM-as-a-judge and testing LLM-generated code. Additionally, we highlight common pitfalls, and show you how prompt variations and generation parameters can silently impact your benchmark. We provide a decision framework balancing latency, cost, and output quality for selecting LLMs.
You will be equipped to better understand the advertised performance of LLMs - beyond the score. Get ready to evaluate LLMs like an engineer!

Lennard Schiefelbein

TNG Technology Consulting

Lennard Schiefelbein is a Senior Consultant at TNG Technology Consulting. In addition to his work as an engineer in different AI projects he contributes to multiple topics in TNG's Innovation Hacking team. There he worked in various fields including reinforcement learning, computer vision, and large language models. Prior to joining TNG he studied mathematics in Bonn and at TU Munich.

Jonathan du Mesnil de Rochemont

TNG Technology Consulting

Jonathan du Mesnil de Rochemont is a Software Consultant at TNG Technology Consulting. As part of the Innovation Hacking team, his main focus lies on the benchmarking and evaluation of large language models. Additionally, he is involved in various projects in machine learning and robotics. Before joining TNG he studied computer science at RWTH Aachen with a focus on graph learning.