Databricks Best Practices & Cost Optimization

Cluster Right-Sizing

Why Cost Optimization Matters

The Problem: Databricks costs can escalate quickly when clusters are over-provisioned, left running idle, or when workloads are not optimized. A single misconfigured cluster can waste thousands of dollars per month.

The Solution: Systematic cost optimization through cluster right-sizing, spot instances, auto-scaling, cluster policies, and workload-aware compute selection can reduce Databricks spend by 30-60%.

Real Impact: Organizations that implement these practices typically save $50K-$500K annually depending on their Databricks footprint.

Cost Optimization Decision Tree

The most impactful cost optimization is choosing the right compute type for each workload. Job clusters cost 60% less than all-purpose clusters per DBU. SQL Warehouses with serverless eliminate idle costs entirely.

Workload Type	Recommended Compute	Auto-Terminate	Spot Workers
Scheduled ETL	Job clusters	Immediate (job scope)	Yes (50-100%)
Ad-hoc Analysis	All-purpose or SQL Warehouse	30 min	Optional
Streaming	Job clusters (always-on)	Never (monitored)	Workers only
ML Training	Job clusters with GPU	Immediate	Workers only
Dashboard Queries	SQL Warehouse (serverless)	Scale-to-zero	N/A (managed)

Spot Instances

Spot instances (AWS) or Spot VMs (Azure) provide up to 90% discount on compute costs. The tradeoff is that the cloud provider can reclaim them with short notice. Databricks handles spot interruptions gracefully by reassigning tasks to remaining nodes.

Spot Instance Best Practices

Driver node: Always use on-demand (spot interruption kills the entire job)
Worker nodes: Use 50-100% spot for batch jobs (Spark recomputes lost tasks)
Streaming: Use on-demand driver + spot workers with fallback to on-demand
Spot fallback: Enable "fall back to on-demand" to prevent job failures during spot shortages
Instance diversity: Allow multiple instance types to improve spot availability

DBU Cost Breakdown

Where Your Databricks Dollars Go

Cluster Policies

Cluster policies enforce guardrails on cluster configurations, preventing users from creating expensive clusters. Policies can restrict instance types, maximum workers, auto-termination settings, and Spark configurations.

JSON - Cluster Policy Definition

{
  "spark_version": {
    "type": "fixed",
    "value": "14.3.x-scala2.12",
    "hidden": true
  },
  "autotermination_minutes": {
    "type": "range",
    "minValue": 10,
    "maxValue": 120,
    "defaultValue": 30
  },
  "num_workers": {
    "type": "range",
    "minValue": 1,
    "maxValue": 10,
    "defaultValue": 2
  },
  "node_type_id": {
    "type": "allowlist",
    "values": [
      {"label": "Small (4 cores, 16 GB)", "value": "Standard_D4s_v3"},
      {"label": "Medium (8 cores, 32 GB)", "value": "Standard_D8s_v3"},
      {"label": "Large (16 cores, 64 GB)", "value": "Standard_D16s_v3"}
    ],
    "defaultValue": "Standard_D4s_v3"
  },
  "aws_attributes.first_on_demand": {
    "type": "fixed",
    "value": 1,
    "hidden": true
  },
  "custom_tags.CostCenter": {
    "type": "fixed",
    "value": "data-engineering"
  }
}

Storage Optimization

Storage costs accumulate from Delta Lake tables, old versions retained by time travel, and small files. Regular maintenance keeps storage costs in check.

SQL - Storage Optimization Commands

-- Check table size and file count
DESCRIBE DETAIL catalog.schema.large_table;

-- VACUUM: Remove old files beyond retention period
-- Default retention: 7 days (168 hours)
VACUUM catalog.schema.large_table RETAIN 168 HOURS;

-- OPTIMIZE: Compact small files into larger ones
OPTIMIZE catalog.schema.large_table;

-- Set shorter retention for non-critical tables
ALTER TABLE catalog.schema.staging_table
SET TBLPROPERTIES (
    delta.deletedFileRetentionDuration = 'interval 24 hours',
    delta.logRetentionDuration = 'interval 7 days'
);

-- Find largest tables in a schema
SELECT
    table_name,
    ROUND(data_source_format_size / 1073741824, 2) AS size_gb
FROM information_schema.tables
WHERE table_schema = 'schema_name'
ORDER BY data_source_format_size DESC
LIMIT 20;

Query Optimization Tips

Use Partition Pruning

Always include partition columns in WHERE clauses. A query filtering on a partitioned date column can skip 99% of the data.

Avoid SELECT *

Select only the columns you need. Columnar formats like Parquet skip reading unneeded columns, saving I/O and compute.

Prefer SQL over UDFs

Built-in SQL functions run in Photon/JVM and are 10-100x faster than Python UDFs. Rewrite UDFs as SQL expressions where possible.

Use Delta Lake Caching

Delta cache on SSD-equipped instances caches frequently accessed data automatically. Use i3 or L-series instances for cache-heavy workloads.

Cost Monitoring

SQL - Cost Monitoring Queries

-- Query the system billing usage table
-- Available in system.billing.usage (Unity Catalog)
SELECT
    usage_date,
    workspace_id,
    sku_name,
    usage_unit,
    SUM(usage_quantity) AS total_dbus,
    ROUND(SUM(usage_quantity) * 0.22, 2) AS estimated_cost_usd
FROM system.billing.usage
WHERE usage_date >= current_date() - INTERVAL 30 DAYS
GROUP BY usage_date, workspace_id, sku_name, usage_unit
ORDER BY estimated_cost_usd DESC;

-- Find most expensive clusters
SELECT
    cluster_id,
    custom_tags:'Team' AS team,
    SUM(usage_quantity) AS total_dbus,
    ROUND(SUM(usage_quantity) * 0.22, 2) AS estimated_cost
FROM system.billing.usage
WHERE usage_date >= current_date() - INTERVAL 7 DAYS
    AND sku_name LIKE '%ALL_PURPOSE%'
GROUP BY cluster_id, team
ORDER BY total_dbus DESC
LIMIT 10;

-- Detect idle clusters (low DBU usage)
SELECT
    cluster_id,
    SUM(usage_quantity) AS total_dbus,
    COUNT(DISTINCT usage_date) AS active_days,
    ROUND(SUM(usage_quantity) / COUNT(DISTINCT usage_date), 1) AS dbus_per_day
FROM system.billing.usage
WHERE usage_date >= current_date() - INTERVAL 7 DAYS
GROUP BY cluster_id
HAVING dbus_per_day < 5
ORDER BY dbus_per_day ASC;

Practice Problems

Problem 1: Cost Reduction Plan

Medium

Your team's Databricks bill is $50K/month. Analysis shows: 60% is all-purpose clusters used for scheduled ETL, 25% is interactive clusters left running overnight, 15% is SQL Warehouse usage. Design a plan to reduce costs by 40%.

Problem 2: Cluster Policy Design

Medium

Design cluster policies for three teams: Data Engineering (needs up to 32 workers for large ETL), Data Science (needs GPU instances), and Business Analytics (only needs small clusters for queries). Include cost controls.

Problem 3: Storage Cost Investigation

Medium

Your cloud storage bill for Databricks data has tripled in 6 months despite data volume only doubling. What are the likely causes and how would you investigate?

Quick Reference

Optimization	Expected Savings	Effort
Job clusters for ETL	60% DBU cost reduction	Low (config change)
Spot instances	30-90% compute savings	Low (enable in policy)
Auto-termination	Eliminates idle costs	Low (cluster policy)
SQL Serverless	Scale-to-zero, no idle	Low (switch warehouse)
VACUUM + OPTIMIZE	30-50% storage reduction	Medium (schedule jobs)
Photon	2-8x faster (lower total cost)	Low (cluster type)
Cluster policies	Prevents over-provisioning	Medium (define + assign)

Cluster Right-Sizing

Why Cost Optimization Matters

Spot Instances

Spot Instance Best Practices

DBU Cost Breakdown

Cluster Policies

Storage Optimization

Query Optimization Tips

Use Partition Pruning

Avoid SELECT *

Prefer SQL over UDFs

Use Delta Lake Caching

Cost Monitoring

Practice Problems

Problem 1: Cost Reduction Plan

Problem 2: Cluster Policy Design

Problem 3: Storage Cost Investigation

Quick Reference

Related Topics