Speaker - Devoxx UK

Lennard Schiefelbein

TNG Technology Consulting

Lennard Schiefelbein is a Senior Consultant at TNG Technology Consulting. In addition to his work as an engineer in different AI projects he contributes to multiple topics in TNG's Innovation Hacking team. There he worked in various fields including reinforcement learning, computer vision, and large language models. Prior to joining TNG he studied mathematics in Bonn and at TU Munich.

View

Beyond the Score: Your Guide to Benchmarking LLMs

Conference (INTERMEDIATE level)

Wednesday from 17:00 17:50

Room D

Benchmark Performance Llm Evaluation

How to select LLMs to deploy on a GPU cluster for hundreds of engineers? LLM Benchmark results are proudly presented for every new model, but what's behind the numbers? We guide you to designing benchmarks that actually measure the real-world problem-solving skills of LLMs.

In this talk, we discuss methods for measuring business-relevant LLM skills such as coding, tool calling and performance in agentic workflows. After exposing some of the issues with popular benchmarks, we show you different methods to evaluate LLMs. Discover techniques such as evaluations with ground truth, LLM-as-a-judge and testing LLM-generated code. Additionally, we highlight common pitfalls, and show you how prompt variations and generation parameters can silently impact your benchmark. We provide a decision framework balancing latency, cost, and output quality for selecting LLMs.

You will be equipped to better understand the advertised performance of LLMs - beyond the score. Get ready to evaluate LLMs like an engineer!

Searching for speaker images...