SQL Warehouses vs Clusters
Why SQL Warehouses Exist
The Problem: Data analysts need fast, reliable SQL access to lakehouse data, but all-purpose Spark clusters are overkill for SQL queries -- they are expensive, slow to start, and designed for general-purpose computing.
The Solution: SQL Warehouses are purpose-built compute endpoints optimized exclusively for SQL workloads, offering faster startup, better query performance, and a familiar SQL IDE experience.
Key Insight: SQL Warehouses use the Photon engine by default, delivering up to 12x faster queries compared to standard Spark SQL on all-purpose clusters.
Real-World Analogy
Think of it like kitchen appliances:
- All-Purpose Cluster = A full kitchen with oven, stove, grill, and prep area -- can do anything but takes time to set up
- SQL Warehouse = A specialized espresso machine -- does one thing (SQL) but does it faster and more efficiently
| Feature | SQL Warehouse | All-Purpose Cluster |
|---|---|---|
| Purpose | SQL queries, dashboards, BI tools | General compute (Python, Scala, ML) |
| Engine | Photon (native C++) | Spark (JVM-based) |
| Startup Time | Seconds (serverless) to minutes | 5-10 minutes typical |
| Concurrency | High (auto-scales query slots) | Limited by cluster size |
| Pricing | SQL Compute DBU rate | All-Purpose Compute DBU rate |
| Interface | SQL Editor, JDBC/ODBC, BI tools | Notebooks, REPL, APIs |
Serverless, Pro, and Classic Warehouses
Databricks offers three tiers of SQL Warehouses, each with different trade-offs between cost, performance, and management overhead.
Serverless SQL Warehouse
Fully managed by Databricks. Starts in seconds, scales instantly, zero infrastructure management. Highest DBU rate but eliminates idle costs completely.
Pro SQL Warehouse
Runs in your cloud account. Includes Photon engine and advanced features like query federation, intelligent workload management, and predictive I/O.
Classic SQL Warehouse
Basic warehouse in your cloud account. Lower cost per DBU but fewer optimization features. Being phased out in favor of Pro and Serverless.
| Feature | Serverless | Pro | Classic |
|---|---|---|---|
| Startup Time | 5-10 seconds | 2-5 minutes | 5-10 minutes |
| Infrastructure | Databricks-managed | Your cloud account | Your cloud account |
| Photon Engine | Yes | Yes | No |
| Intelligent Workload Mgmt | Yes | Yes | No |
| Query Federation | Yes | Yes | No |
| Predictive I/O | Yes | Yes | No |
| DBU Cost | Highest | Medium | Lowest |
-- Create a SQL Warehouse using the SQL Warehouses API
-- POST /api/2.0/sql/warehouses
-- Python equivalent:
import requests
config = {
"name": "analytics-warehouse",
"cluster_size": "Medium", # 2X-Small to 4X-Large
"min_num_clusters": 1,
"max_num_clusters": 3, # Auto-scale clusters for concurrency
"auto_stop_mins": 15,
"warehouse_type": "PRO", # PRO, CLASSIC, or leave empty for serverless
"enable_serverless_compute": False,
"spot_instance_policy": "COST_OPTIMIZED",
"tags": {
"custom_tags": [{"key": "team", "value": "analytics"}]
}
}
response = requests.post(
f"{DATABRICKS_HOST}/api/2.0/sql/warehouses",
headers=headers,
json=config
)
Dashboards & Alerts
Databricks SQL includes a full dashboarding and alerting system, allowing teams to build data products directly on top of their lakehouse without external BI tools.
-- Revenue Dashboard: Daily revenue trend
SELECT
date_trunc('day', order_date) AS day,
SUM(total_amount) AS daily_revenue,
COUNT(DISTINCT customer_id) AS unique_customers,
AVG(total_amount) AS avg_order_value
FROM prod_catalog.sales.orders
WHERE order_date >= current_date() - INTERVAL 30 DAYS
GROUP BY 1
ORDER BY 1;
-- Alert Query: Detect revenue drops
SELECT
CASE
WHEN today_revenue < yesterday_revenue * 0.8
THEN 'ALERT: Revenue dropped 20%+'
ELSE 'OK'
END AS status,
today_revenue,
yesterday_revenue
FROM (
SELECT
SUM(CASE WHEN order_date = current_date() THEN total_amount END) AS today_revenue,
SUM(CASE WHEN order_date = current_date() - 1 THEN total_amount END) AS yesterday_revenue
FROM prod_catalog.sales.orders
);
-- Top Products with Window Functions
SELECT
product_name,
category,
total_sold,
RANK() OVER (PARTITION BY category ORDER BY total_sold DESC) AS category_rank
FROM (
SELECT product_name, category, SUM(quantity) AS total_sold
FROM prod_catalog.sales.order_items
GROUP BY product_name, category
)
WHERE category_rank <= 5;
Query Optimization
SQL Warehouses include powerful query optimization features that can dramatically improve performance with minimal effort.
-- 1. Use Delta Lake statistics for pruning
ANALYZE TABLE prod_catalog.sales.orders COMPUTE STATISTICS FOR ALL COLUMNS;
-- 2. Optimize file layout with Z-ORDER
OPTIMIZE prod_catalog.sales.orders
ZORDER BY (customer_id, order_date);
-- 3. Use result caching -- repeated queries return instantly
-- Result cache is ON by default for SQL Warehouses
SET use_cached_result = true;
-- 4. Partition large tables by date
CREATE TABLE prod_catalog.sales.events_partitioned
USING delta
PARTITIONED BY (event_date)
AS SELECT * FROM prod_catalog.sales.events;
-- 5. Use EXPLAIN to analyze query plans
EXPLAIN FORMATTED
SELECT customer_id, SUM(amount)
FROM prod_catalog.sales.orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id;
-- 6. Materialized Views for repeated complex queries
CREATE MATERIALIZED VIEW prod_catalog.sales.daily_summary
AS SELECT
date_trunc('day', order_date) AS day,
SUM(total_amount) AS revenue,
COUNT(*) AS order_count
FROM prod_catalog.sales.orders
GROUP BY 1;
Query Performance Checklist
| Technique | Impact | When to Use |
|---|---|---|
| Result Caching | Instant repeated queries | Dashboard queries that run frequently |
| Z-ORDER | 10-100x faster point lookups | Tables frequently filtered on specific columns |
| COMPUTE STATISTICS | Better query plans | After large data loads or schema changes |
| Partitioning | Skip irrelevant data files | Large tables with date-based access patterns |
| Materialized Views | Pre-computed aggregations | Complex aggregations queried repeatedly |
Practice Problems
Problem 1: Warehouse Selection
EasyYour company has three use cases: (1) a real-time executive dashboard refreshed every 5 minutes, (2) ad-hoc analyst queries during business hours, and (3) a BI tool (Tableau) used by 50 concurrent users. Which warehouse type and configuration would you recommend for each?
Problem 2: Query Optimization
EasyA dashboard query scanning a 500GB orders table takes 45 seconds. The query filters by customer_id and order_date. What optimization steps would you take to get it under 5 seconds?
Problem 3: Cost vs Performance Trade-off
MediumYour team spends $8,000/month on a Serverless SQL Warehouse that runs 24/7. Usage analysis shows queries only happen during business hours (8am-6pm, Mon-Fri). Redesign the setup to cut costs by 50% without sacrificing user experience.
Quick Reference
| Concept | Description | Key Point |
|---|---|---|
| SQL Warehouse | SQL-optimized compute endpoint | Uses Photon engine, JDBC/ODBC compatible |
| Serverless | Databricks-managed infrastructure | Instant start, zero idle cost, highest DBU rate |
| Pro | Customer cloud with advanced features | Best balance of features and cost |
| Classic | Basic warehouse, being deprecated | Lowest DBU rate, no Photon |
| Result Caching | Caches query results for reuse | Enabled by default on SQL Warehouses |
| Query Federation | Query external databases from SQL Warehouse | Supports PostgreSQL, MySQL, SQL Server |