Databricks SQL & SQL Warehouses

SQL Warehouses vs Clusters

Why SQL Warehouses Exist

The Problem: Data analysts need fast, reliable SQL access to lakehouse data, but all-purpose Spark clusters are overkill for SQL queries -- they are expensive, slow to start, and designed for general-purpose computing.

The Solution: SQL Warehouses are purpose-built compute endpoints optimized exclusively for SQL workloads, offering faster startup, better query performance, and a familiar SQL IDE experience.

Key Insight: SQL Warehouses use the Photon engine by default, delivering up to 12x faster queries compared to standard Spark SQL on all-purpose clusters.

Real-World Analogy

Think of it like kitchen appliances:

All-Purpose Cluster = A full kitchen with oven, stove, grill, and prep area -- can do anything but takes time to set up
SQL Warehouse = A specialized espresso machine -- does one thing (SQL) but does it faster and more efficiently

Feature	SQL Warehouse	All-Purpose Cluster
Purpose	SQL queries, dashboards, BI tools	General compute (Python, Scala, ML)
Engine	Photon (native C++)	Spark (JVM-based)
Startup Time	Seconds (serverless) to minutes	5-10 minutes typical
Concurrency	High (auto-scales query slots)	Limited by cluster size
Pricing	SQL Compute DBU rate	All-Purpose Compute DBU rate
Interface	SQL Editor, JDBC/ODBC, BI tools	Notebooks, REPL, APIs

SQL Warehouse Architecture

Serverless, Pro, and Classic Warehouses

Databricks offers three tiers of SQL Warehouses, each with different trade-offs between cost, performance, and management overhead.

Serverless SQL Warehouse

Fully managed by Databricks. Starts in seconds, scales instantly, zero infrastructure management. Highest DBU rate but eliminates idle costs completely.

Pro SQL Warehouse

Runs in your cloud account. Includes Photon engine and advanced features like query federation, intelligent workload management, and predictive I/O.

Classic SQL Warehouse

Basic warehouse in your cloud account. Lower cost per DBU but fewer optimization features. Being phased out in favor of Pro and Serverless.

Feature	Serverless	Pro	Classic
Startup Time	5-10 seconds	2-5 minutes	5-10 minutes
Infrastructure	Databricks-managed	Your cloud account	Your cloud account
Photon Engine	Yes	Yes	No
Intelligent Workload Mgmt	Yes	Yes	No
Query Federation	Yes	Yes	No
Predictive I/O	Yes	Yes	No
DBU Cost	Highest	Medium	Lowest

SQL - Creating a SQL Warehouse via API

-- Create a SQL Warehouse using the SQL Warehouses API
-- POST /api/2.0/sql/warehouses

-- Python equivalent:
import requests

config = {
    "name": "analytics-warehouse",
    "cluster_size": "Medium",     # 2X-Small to 4X-Large
    "min_num_clusters": 1,
    "max_num_clusters": 3,       # Auto-scale clusters for concurrency
    "auto_stop_mins": 15,
    "warehouse_type": "PRO",     # PRO, CLASSIC, or leave empty for serverless
    "enable_serverless_compute": False,
    "spot_instance_policy": "COST_OPTIMIZED",
    "tags": {
        "custom_tags": [{"key": "team", "value": "analytics"}]
    }
}

response = requests.post(
    f"{DATABRICKS_HOST}/api/2.0/sql/warehouses",
    headers=headers,
    json=config
)

Dashboards & Alerts

Databricks SQL includes a full dashboarding and alerting system, allowing teams to build data products directly on top of their lakehouse without external BI tools.

SQL - Dashboard Query Examples

-- Revenue Dashboard: Daily revenue trend
SELECT
    date_trunc('day', order_date) AS day,
    SUM(total_amount) AS daily_revenue,
    COUNT(DISTINCT customer_id) AS unique_customers,
    AVG(total_amount) AS avg_order_value
FROM prod_catalog.sales.orders
WHERE order_date >= current_date() - INTERVAL 30 DAYS
GROUP BY 1
ORDER BY 1;

-- Alert Query: Detect revenue drops
SELECT
    CASE
        WHEN today_revenue < yesterday_revenue * 0.8
        THEN 'ALERT: Revenue dropped 20%+'
        ELSE 'OK'
    END AS status,
    today_revenue,
    yesterday_revenue
FROM (
    SELECT
        SUM(CASE WHEN order_date = current_date() THEN total_amount END) AS today_revenue,
        SUM(CASE WHEN order_date = current_date() - 1 THEN total_amount END) AS yesterday_revenue
    FROM prod_catalog.sales.orders
);

-- Top Products with Window Functions
SELECT
    product_name,
    category,
    total_sold,
    RANK() OVER (PARTITION BY category ORDER BY total_sold DESC) AS category_rank
FROM (
    SELECT product_name, category, SUM(quantity) AS total_sold
    FROM prod_catalog.sales.order_items
    GROUP BY product_name, category
)
WHERE category_rank <= 5;

Query Optimization

SQL Warehouses include powerful query optimization features that can dramatically improve performance with minimal effort.

SQL - Optimization Techniques

-- 1. Use Delta Lake statistics for pruning
ANALYZE TABLE prod_catalog.sales.orders COMPUTE STATISTICS FOR ALL COLUMNS;

-- 2. Optimize file layout with Z-ORDER
OPTIMIZE prod_catalog.sales.orders
ZORDER BY (customer_id, order_date);

-- 3. Use result caching -- repeated queries return instantly
-- Result cache is ON by default for SQL Warehouses
SET use_cached_result = true;

-- 4. Partition large tables by date
CREATE TABLE prod_catalog.sales.events_partitioned
USING delta
PARTITIONED BY (event_date)
AS SELECT * FROM prod_catalog.sales.events;

-- 5. Use EXPLAIN to analyze query plans
EXPLAIN FORMATTED
SELECT customer_id, SUM(amount)
FROM prod_catalog.sales.orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id;

-- 6. Materialized Views for repeated complex queries
CREATE MATERIALIZED VIEW prod_catalog.sales.daily_summary
AS SELECT
    date_trunc('day', order_date) AS day,
    SUM(total_amount) AS revenue,
    COUNT(*) AS order_count
FROM prod_catalog.sales.orders
GROUP BY 1;

Query Performance Checklist

Technique	Impact	When to Use
Result Caching	Instant repeated queries	Dashboard queries that run frequently
Z-ORDER	10-100x faster point lookups	Tables frequently filtered on specific columns
COMPUTE STATISTICS	Better query plans	After large data loads or schema changes
Partitioning	Skip irrelevant data files	Large tables with date-based access patterns
Materialized Views	Pre-computed aggregations	Complex aggregations queried repeatedly

Practice Problems

Problem 1: Warehouse Selection

Easy

Your company has three use cases: (1) a real-time executive dashboard refreshed every 5 minutes, (2) ad-hoc analyst queries during business hours, and (3) a BI tool (Tableau) used by 50 concurrent users. Which warehouse type and configuration would you recommend for each?

Problem 2: Query Optimization

Easy

A dashboard query scanning a 500GB orders table takes 45 seconds. The query filters by customer_id and order_date. What optimization steps would you take to get it under 5 seconds?

Problem 3: Cost vs Performance Trade-off

Medium

Your team spends $8,000/month on a Serverless SQL Warehouse that runs 24/7. Usage analysis shows queries only happen during business hours (8am-6pm, Mon-Fri). Redesign the setup to cut costs by 50% without sacrificing user experience.

Quick Reference

Concept	Description	Key Point
SQL Warehouse	SQL-optimized compute endpoint	Uses Photon engine, JDBC/ODBC compatible
Serverless	Databricks-managed infrastructure	Instant start, zero idle cost, highest DBU rate
Pro	Customer cloud with advanced features	Best balance of features and cost
Classic	Basic warehouse, being deprecated	Lowest DBU rate, no Photon
Result Caching	Caches query results for reuse	Enabled by default on SQL Warehouses
Query Federation	Query external databases from SQL Warehouse	Supports PostgreSQL, MySQL, SQL Server