Databricks SQL & SQL Warehouses

Easy 25 min read

SQL Warehouses vs Clusters

Why SQL Warehouses Exist

The Problem: Data analysts need fast, reliable SQL access to lakehouse data, but all-purpose Spark clusters are overkill for SQL queries -- they are expensive, slow to start, and designed for general-purpose computing.

The Solution: SQL Warehouses are purpose-built compute endpoints optimized exclusively for SQL workloads, offering faster startup, better query performance, and a familiar SQL IDE experience.

Key Insight: SQL Warehouses use the Photon engine by default, delivering up to 12x faster queries compared to standard Spark SQL on all-purpose clusters.

Real-World Analogy

Think of it like kitchen appliances:

  • All-Purpose Cluster = A full kitchen with oven, stove, grill, and prep area -- can do anything but takes time to set up
  • SQL Warehouse = A specialized espresso machine -- does one thing (SQL) but does it faster and more efficiently
FeatureSQL WarehouseAll-Purpose Cluster
PurposeSQL queries, dashboards, BI toolsGeneral compute (Python, Scala, ML)
EnginePhoton (native C++)Spark (JVM-based)
Startup TimeSeconds (serverless) to minutes5-10 minutes typical
ConcurrencyHigh (auto-scales query slots)Limited by cluster size
PricingSQL Compute DBU rateAll-Purpose Compute DBU rate
InterfaceSQL Editor, JDBC/ODBC, BI toolsNotebooks, REPL, APIs
SQL Warehouse Architecture
Clients SQL Editor BI Tools JDBC / ODBC REST API Dashboards SQL Gateway Query Routing Caching Layer Queue Management Result Caching SQL Warehouse Photon Engine Query Optimizer Compute Cluster 1 Compute Cluster 2 Data Lake Delta Tables Unity Catalog Cloud Storage

Serverless, Pro, and Classic Warehouses

Databricks offers three tiers of SQL Warehouses, each with different trade-offs between cost, performance, and management overhead.

Serverless SQL Warehouse

Fully managed by Databricks. Starts in seconds, scales instantly, zero infrastructure management. Highest DBU rate but eliminates idle costs completely.

Pro SQL Warehouse

Runs in your cloud account. Includes Photon engine and advanced features like query federation, intelligent workload management, and predictive I/O.

Classic SQL Warehouse

Basic warehouse in your cloud account. Lower cost per DBU but fewer optimization features. Being phased out in favor of Pro and Serverless.

FeatureServerlessProClassic
Startup Time5-10 seconds2-5 minutes5-10 minutes
InfrastructureDatabricks-managedYour cloud accountYour cloud account
Photon EngineYesYesNo
Intelligent Workload MgmtYesYesNo
Query FederationYesYesNo
Predictive I/OYesYesNo
DBU CostHighestMediumLowest
SQL - Creating a SQL Warehouse via API
-- Create a SQL Warehouse using the SQL Warehouses API
-- POST /api/2.0/sql/warehouses

-- Python equivalent:
import requests

config = {
    "name": "analytics-warehouse",
    "cluster_size": "Medium",     # 2X-Small to 4X-Large
    "min_num_clusters": 1,
    "max_num_clusters": 3,       # Auto-scale clusters for concurrency
    "auto_stop_mins": 15,
    "warehouse_type": "PRO",     # PRO, CLASSIC, or leave empty for serverless
    "enable_serverless_compute": False,
    "spot_instance_policy": "COST_OPTIMIZED",
    "tags": {
        "custom_tags": [{"key": "team", "value": "analytics"}]
    }
}

response = requests.post(
    f"{DATABRICKS_HOST}/api/2.0/sql/warehouses",
    headers=headers,
    json=config
)

Dashboards & Alerts

Databricks SQL includes a full dashboarding and alerting system, allowing teams to build data products directly on top of their lakehouse without external BI tools.

SQL - Dashboard Query Examples
-- Revenue Dashboard: Daily revenue trend
SELECT
    date_trunc('day', order_date) AS day,
    SUM(total_amount) AS daily_revenue,
    COUNT(DISTINCT customer_id) AS unique_customers,
    AVG(total_amount) AS avg_order_value
FROM prod_catalog.sales.orders
WHERE order_date >= current_date() - INTERVAL 30 DAYS
GROUP BY 1
ORDER BY 1;

-- Alert Query: Detect revenue drops
SELECT
    CASE
        WHEN today_revenue < yesterday_revenue * 0.8
        THEN 'ALERT: Revenue dropped 20%+'
        ELSE 'OK'
    END AS status,
    today_revenue,
    yesterday_revenue
FROM (
    SELECT
        SUM(CASE WHEN order_date = current_date() THEN total_amount END) AS today_revenue,
        SUM(CASE WHEN order_date = current_date() - 1 THEN total_amount END) AS yesterday_revenue
    FROM prod_catalog.sales.orders
);

-- Top Products with Window Functions
SELECT
    product_name,
    category,
    total_sold,
    RANK() OVER (PARTITION BY category ORDER BY total_sold DESC) AS category_rank
FROM (
    SELECT product_name, category, SUM(quantity) AS total_sold
    FROM prod_catalog.sales.order_items
    GROUP BY product_name, category
)
WHERE category_rank <= 5;

Query Optimization

SQL Warehouses include powerful query optimization features that can dramatically improve performance with minimal effort.

SQL - Optimization Techniques
-- 1. Use Delta Lake statistics for pruning
ANALYZE TABLE prod_catalog.sales.orders COMPUTE STATISTICS FOR ALL COLUMNS;

-- 2. Optimize file layout with Z-ORDER
OPTIMIZE prod_catalog.sales.orders
ZORDER BY (customer_id, order_date);

-- 3. Use result caching -- repeated queries return instantly
-- Result cache is ON by default for SQL Warehouses
SET use_cached_result = true;

-- 4. Partition large tables by date
CREATE TABLE prod_catalog.sales.events_partitioned
USING delta
PARTITIONED BY (event_date)
AS SELECT * FROM prod_catalog.sales.events;

-- 5. Use EXPLAIN to analyze query plans
EXPLAIN FORMATTED
SELECT customer_id, SUM(amount)
FROM prod_catalog.sales.orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id;

-- 6. Materialized Views for repeated complex queries
CREATE MATERIALIZED VIEW prod_catalog.sales.daily_summary
AS SELECT
    date_trunc('day', order_date) AS day,
    SUM(total_amount) AS revenue,
    COUNT(*) AS order_count
FROM prod_catalog.sales.orders
GROUP BY 1;

Query Performance Checklist

TechniqueImpactWhen to Use
Result CachingInstant repeated queriesDashboard queries that run frequently
Z-ORDER10-100x faster point lookupsTables frequently filtered on specific columns
COMPUTE STATISTICSBetter query plansAfter large data loads or schema changes
PartitioningSkip irrelevant data filesLarge tables with date-based access patterns
Materialized ViewsPre-computed aggregationsComplex aggregations queried repeatedly

Practice Problems

Problem 1: Warehouse Selection

Easy

Your company has three use cases: (1) a real-time executive dashboard refreshed every 5 minutes, (2) ad-hoc analyst queries during business hours, and (3) a BI tool (Tableau) used by 50 concurrent users. Which warehouse type and configuration would you recommend for each?

Problem 2: Query Optimization

Easy

A dashboard query scanning a 500GB orders table takes 45 seconds. The query filters by customer_id and order_date. What optimization steps would you take to get it under 5 seconds?

Problem 3: Cost vs Performance Trade-off

Medium

Your team spends $8,000/month on a Serverless SQL Warehouse that runs 24/7. Usage analysis shows queries only happen during business hours (8am-6pm, Mon-Fri). Redesign the setup to cut costs by 50% without sacrificing user experience.

Quick Reference

ConceptDescriptionKey Point
SQL WarehouseSQL-optimized compute endpointUses Photon engine, JDBC/ODBC compatible
ServerlessDatabricks-managed infrastructureInstant start, zero idle cost, highest DBU rate
ProCustomer cloud with advanced featuresBest balance of features and cost
ClassicBasic warehouse, being deprecatedLowest DBU rate, no Photon
Result CachingCaches query results for reuseEnabled by default on SQL Warehouses
Query FederationQuery external databases from SQL WarehouseSupports PostgreSQL, MySQL, SQL Server