Optimal Data Lake for analytics: Apache Kafka and ClickHouse

Data & AI
Voting no longer possible
Voting enabled when talk has started

Apache Kafka is amazing at handling real-time data feeds. However, in certain cases we need to come back to older records to analyse and process data at later times. This is challenging because storing records indefinitely and doing large scans of data in Apache Kafka is not optimal.

ClickHouse, on the other hand, is a scalable and reliable storage designed to handle petabytes of data and, at the same time, it is a powerful open-source tool for fast online analytical processing. It was initially developed by Yandex and is now used by many companies for data analytics. What's more ClickHouse has a built-in table engine to publish and subscribe to Apache Kafka data feeds.

Hence, we can use Apache Kafka and ClickHouse in collaboration to transition older records to a data lake in order to perform analytics at scale on top of fresh data that Kafka provides.

In this talk you’ll learn how to use Apache Kafka together with ClickHouse and how to query the data stored in the data lake. This session is for those who want to perform analytics with fast response time over a huge volume of data without the need to downsample it.

Olena Kutsenko


Olena is a software engineer and a developer advocate currently working at Aiven. She is passionate about open source, data, sustainable software development and team work. Her knowledge is shaped by expertise she acquired working in such companies as Nokia, HERE Technologies and AWS; and from the countries she was lucky to live in - Ukraine, Sweden, Spain and Germany.