Databricks Best Practices & Cost Optimization

Medium25 min read

Cluster Right-Sizing

Why Cost Optimization Matters

The Problem: Databricks costs can escalate quickly when clusters are over-provisioned, left running idle, or when workloads are not optimized. A single misconfigured cluster can waste thousands of dollars per month.

The Solution: Systematic cost optimization through cluster right-sizing, spot instances, auto-scaling, cluster policies, and workload-aware compute selection can reduce Databricks spend by 30-60%.

Real Impact: Organizations that implement these practices typically save $50K-$500K annually depending on their Databricks footprint.

Cost Optimization Decision Tree
What is the workload? ETL / Batch Jobs Job clusters (auto-terminate) Spot instances for workers Auto-scaling 2-20 workers Photon for SQL-heavy ETL Interactive / Dev All-purpose clusters Auto-terminate: 30 min Single node for small data Shared clusters for teams SQL Analytics SQL Warehouses (serverless) Scale-to-zero when idle Auto-scaling by query load T-shirt sizing (2XS to 4XL) DBU Cost Comparison (approximate) Jobs Compute: $0.15/DBU + cloud VM cost All-Purpose: $0.40/DBU + cloud VM cost SQL Serverless: $0.22/DBU VM cost included

The most impactful cost optimization is choosing the right compute type for each workload. Job clusters cost 60% less than all-purpose clusters per DBU. SQL Warehouses with serverless eliminate idle costs entirely.

Workload TypeRecommended ComputeAuto-TerminateSpot Workers
Scheduled ETLJob clustersImmediate (job scope)Yes (50-100%)
Ad-hoc AnalysisAll-purpose or SQL Warehouse30 minOptional
StreamingJob clusters (always-on)Never (monitored)Workers only
ML TrainingJob clusters with GPUImmediateWorkers only
Dashboard QueriesSQL Warehouse (serverless)Scale-to-zeroN/A (managed)

Spot Instances

Spot instances (AWS) or Spot VMs (Azure) provide up to 90% discount on compute costs. The tradeoff is that the cloud provider can reclaim them with short notice. Databricks handles spot interruptions gracefully by reassigning tasks to remaining nodes.

Spot Instance Best Practices

  • Driver node: Always use on-demand (spot interruption kills the entire job)
  • Worker nodes: Use 50-100% spot for batch jobs (Spark recomputes lost tasks)
  • Streaming: Use on-demand driver + spot workers with fallback to on-demand
  • Spot fallback: Enable "fall back to on-demand" to prevent job failures during spot shortages
  • Instance diversity: Allow multiple instance types to improve spot availability

DBU Cost Breakdown

Where Your Databricks Dollars Go
Compute (40%) DBU Fees (30%) Storage (15%) Net (10%) 5% Cloud VM compute (EC2/Azure VMs) Databricks DBU fees (platform markup) Cloud storage (S3/ADLS/GCS) Data transfer and networking Other (logging, monitoring, secrets) Biggest savings: right-size clusters (compute) + use job clusters (reduce DBU rate by 60%)

Cluster Policies

Cluster policies enforce guardrails on cluster configurations, preventing users from creating expensive clusters. Policies can restrict instance types, maximum workers, auto-termination settings, and Spark configurations.

JSON - Cluster Policy Definition
{
  "spark_version": {
    "type": "fixed",
    "value": "14.3.x-scala2.12",
    "hidden": true
  },
  "autotermination_minutes": {
    "type": "range",
    "minValue": 10,
    "maxValue": 120,
    "defaultValue": 30
  },
  "num_workers": {
    "type": "range",
    "minValue": 1,
    "maxValue": 10,
    "defaultValue": 2
  },
  "node_type_id": {
    "type": "allowlist",
    "values": [
      {"label": "Small (4 cores, 16 GB)", "value": "Standard_D4s_v3"},
      {"label": "Medium (8 cores, 32 GB)", "value": "Standard_D8s_v3"},
      {"label": "Large (16 cores, 64 GB)", "value": "Standard_D16s_v3"}
    ],
    "defaultValue": "Standard_D4s_v3"
  },
  "aws_attributes.first_on_demand": {
    "type": "fixed",
    "value": 1,
    "hidden": true
  },
  "custom_tags.CostCenter": {
    "type": "fixed",
    "value": "data-engineering"
  }
}

Storage Optimization

Storage costs accumulate from Delta Lake tables, old versions retained by time travel, and small files. Regular maintenance keeps storage costs in check.

SQL - Storage Optimization Commands
-- Check table size and file count
DESCRIBE DETAIL catalog.schema.large_table;

-- VACUUM: Remove old files beyond retention period
-- Default retention: 7 days (168 hours)
VACUUM catalog.schema.large_table RETAIN 168 HOURS;

-- OPTIMIZE: Compact small files into larger ones
OPTIMIZE catalog.schema.large_table;

-- Set shorter retention for non-critical tables
ALTER TABLE catalog.schema.staging_table
SET TBLPROPERTIES (
    delta.deletedFileRetentionDuration = 'interval 24 hours',
    delta.logRetentionDuration = 'interval 7 days'
);

-- Find largest tables in a schema
SELECT
    table_name,
    ROUND(data_source_format_size / 1073741824, 2) AS size_gb
FROM information_schema.tables
WHERE table_schema = 'schema_name'
ORDER BY data_source_format_size DESC
LIMIT 20;

Query Optimization Tips

Use Partition Pruning

Always include partition columns in WHERE clauses. A query filtering on a partitioned date column can skip 99% of the data.

Avoid SELECT *

Select only the columns you need. Columnar formats like Parquet skip reading unneeded columns, saving I/O and compute.

Prefer SQL over UDFs

Built-in SQL functions run in Photon/JVM and are 10-100x faster than Python UDFs. Rewrite UDFs as SQL expressions where possible.

Use Delta Lake Caching

Delta cache on SSD-equipped instances caches frequently accessed data automatically. Use i3 or L-series instances for cache-heavy workloads.

Cost Monitoring

SQL - Cost Monitoring Queries
-- Query the system billing usage table
-- Available in system.billing.usage (Unity Catalog)
SELECT
    usage_date,
    workspace_id,
    sku_name,
    usage_unit,
    SUM(usage_quantity) AS total_dbus,
    ROUND(SUM(usage_quantity) * 0.22, 2) AS estimated_cost_usd
FROM system.billing.usage
WHERE usage_date >= current_date() - INTERVAL 30 DAYS
GROUP BY usage_date, workspace_id, sku_name, usage_unit
ORDER BY estimated_cost_usd DESC;

-- Find most expensive clusters
SELECT
    cluster_id,
    custom_tags:'Team' AS team,
    SUM(usage_quantity) AS total_dbus,
    ROUND(SUM(usage_quantity) * 0.22, 2) AS estimated_cost
FROM system.billing.usage
WHERE usage_date >= current_date() - INTERVAL 7 DAYS
    AND sku_name LIKE '%ALL_PURPOSE%'
GROUP BY cluster_id, team
ORDER BY total_dbus DESC
LIMIT 10;

-- Detect idle clusters (low DBU usage)
SELECT
    cluster_id,
    SUM(usage_quantity) AS total_dbus,
    COUNT(DISTINCT usage_date) AS active_days,
    ROUND(SUM(usage_quantity) / COUNT(DISTINCT usage_date), 1) AS dbus_per_day
FROM system.billing.usage
WHERE usage_date >= current_date() - INTERVAL 7 DAYS
GROUP BY cluster_id
HAVING dbus_per_day < 5
ORDER BY dbus_per_day ASC;

Practice Problems

Problem 1: Cost Reduction Plan

Medium

Your team's Databricks bill is $50K/month. Analysis shows: 60% is all-purpose clusters used for scheduled ETL, 25% is interactive clusters left running overnight, 15% is SQL Warehouse usage. Design a plan to reduce costs by 40%.

Problem 2: Cluster Policy Design

Medium

Design cluster policies for three teams: Data Engineering (needs up to 32 workers for large ETL), Data Science (needs GPU instances), and Business Analytics (only needs small clusters for queries). Include cost controls.

Problem 3: Storage Cost Investigation

Medium

Your cloud storage bill for Databricks data has tripled in 6 months despite data volume only doubling. What are the likely causes and how would you investigate?

Quick Reference

OptimizationExpected SavingsEffort
Job clusters for ETL60% DBU cost reductionLow (config change)
Spot instances30-90% compute savingsLow (enable in policy)
Auto-terminationEliminates idle costsLow (cluster policy)
SQL ServerlessScale-to-zero, no idleLow (switch warehouse)
VACUUM + OPTIMIZE30-50% storage reductionMedium (schedule jobs)
Photon2-8x faster (lower total cost)Low (cluster type)
Cluster policiesPrevents over-provisioningMedium (define + assign)