Legare Kerrison is an Open Source Engineer and Developer Advocate on Red Hat's AI team. She focuses on open source tools for building and deploying AI. Currently, she works with projects like vLLM and Podman desktop. She aims to make technical complexity digestible. She loves matcha and the outdoors. Based in Boston.
Thanks to open source, in the past year, we’ve seen a fundamental change: developers and enterprises are moving away from proprietary, closed-source models. To save costs, prioritize privacy, and allow for customization, they are building, testing, and deploying their own open models. However, this journey can feel overwhelming. Which foundation model should I use? How do I connect my model to existing data sources or build agentic capabilities to start seeing real value with AI, especially in an already existing Java application?
The key to navigating this emerging path is adopting the flexibility, transparency, and collaboration of open source that many of us are familiar with. We'll walk through the critical aspects of AI feature implementation using LangChain4J, also showing observability (OpenTelemetry), testing (Promptfoo), CI/CD (Tekton) and more. Join us as we get hands-on with language models and use open technologies to control our own AI journey!
Generative AI models are impressive, but the moment you try to run one behind a real app, the bill (and the latency) can get out of hand. A lot of that comes down to inference: what you serve the model on, how well it uses GPU memory, and how it behaves when multiple requests hit at once.
This talk is a demo-driven walkthrough of serving LLMs efficiently with vLLM, an open source inference engine that exposes an OpenAI-compatible API—so you can wire it into an app without inventing a new protocol. We’ll start by serving a baseline model and driving traffic from a simple client while watching the numbers that matter: latency (including tail latency), throughput, and memory use.
Then we’ll change one thing: we’ll swap in quantized versions of the same model. You’ll see what improves, what doesn’t, and where the trade-offs show up in practice. We’ll repeat the exact same workload so it’s obvious what changed and why, and we’ll cover a few practical tuning knobs in vLLM that can make or break performance under load.
You’ll walk away with a clear mental model of “efficient inference,” a repeatable way to test your own setup, and a pragmatic sense of when quantization is the right move for your app—without needing an ML background.
Searching for speaker images...
