What is ksqlDB?
ksqlDB lets you write stream processing logic using SQL instead of Java/Python code. It runs on top of Kafka Streams but exposes a familiar SQL interface. You can filter, join, aggregate, and window Kafka data with simple SQL statements.
-- Create a stream from a Kafka topic
CREATE STREAM page_views (
user_id VARCHAR KEY,
page VARCHAR,
timestamp BIGINT
) WITH (
KAFKA_TOPIC = 'page-views',
VALUE_FORMAT = 'JSON'
);
-- Filter: only homepage views
CREATE STREAM homepage_views AS
SELECT * FROM page_views WHERE page = '/home';
-- Aggregate: count views per page in 5-minute windows
CREATE TABLE page_view_counts AS
SELECT page, COUNT(*) AS view_count
FROM page_views
WINDOW TUMBLING (SIZE 5 MINUTES)
GROUP BY page;
-- Query the materialized table
SELECT * FROM page_view_counts WHERE page = '/home';
Streams vs Tables in ksqlDB
| Concept | SQL Analogy | Kafka Analogy | Example |
|---|---|---|---|
| STREAM | Unbounded INSERT-only table | KStream (event log) | Page views, clicks, orders |
| TABLE | Materialized view with primary key | KTable (latest per key) | User profiles, account balances |
Push vs Pull Queries
Pull queries (like traditional SQL) return current state: SELECT * FROM users WHERE id = 42;
Push queries stream continuous results: SELECT * FROM page_views EMIT CHANGES; — stays open and pushes every new event.
Key Takeaway: ksqlDB is the fastest way to build stream processing without writing code. Use it for real-time dashboards, alerting, data enrichment, and ETL. For complex logic, use Kafka Streams directly.
Practice Exercises
Hard Production Scenario
Design a solution using these concepts for a real-world production system.
Hard Performance Analysis
Benchmark two different approaches and explain which is better and why.