Databricks Best Practices & Cost Optimization

Medium25 min read

Cluster Right-Sizing

Why Cost Optimization Matters

The Problem: Databricks costs can escalate quickly when clusters are over-provisioned, left running idle, or when workloads are not optimized. A single misconfigured cluster can waste thousands of dollars per month.

The Solution: Systematic cost optimization through cluster right-sizing, spot instances, auto-scaling, cluster policies, and workload-aware compute selection can reduce Databricks spend by 30-60%.

Real Impact: Organizations that implement these practices typically save $50K-$500K annually depending on their Databricks footprint.

Cost Optimization Decision Tree
What is the workload? ETL / Batch Jobs Job clusters (auto-terminate) Spot instances for workers Auto-scaling 2-20 workers Photon for SQL-heavy ETL Interactive / Dev All-purpose clusters Auto-terminate: 30 min Single node for small data Shared clusters for teams SQL Analytics SQL Warehouses (serverless) Scale-to-zero when idle Auto-scaling by query load T-shirt sizing (2XS to 4XL) DBU Cost Comparison (approximate) Jobs Compute: $0.15/DBU + cloud VM cost All-Purpose: $0.40/DBU + cloud VM cost SQL Serverless: $0.22/DBU VM cost included

The most impactful cost optimization is choosing the right compute type for each workload. Job clusters cost 60% less than all-purpose clusters per DBU. SQL Warehouses with serverless eliminate idle costs entirely.

Workload TypeRecommended ComputeAuto-TerminateSpot Workers
Scheduled ETLJob clustersImmediate (job scope)Yes (50-100%)
Ad-hoc AnalysisAll-purpose or SQL Warehouse30 minOptional
StreamingJob clusters (always-on)Never (monitored)Workers only
ML TrainingJob clusters with GPUImmediateWorkers only
Dashboard QueriesSQL Warehouse (serverless)Scale-to-zeroN/A (managed)

Spot Instances

Key Takeaway: Always match compute type to workload -- job clusters for ETL (60% cheaper), SQL Warehouses for analytics (scale-to-zero), and all-purpose clusters only for interactive development.

Spot instances (AWS) or Spot VMs (Azure) provide up to 90% discount on compute costs. The tradeoff is that the cloud provider can reclaim them with short notice. Databricks handles spot interruptions gracefully by reassigning tasks to remaining nodes.

Spot Instance Best Practices

  • Driver node: Always use on-demand (spot interruption kills the entire job)
  • Worker nodes: Use 50-100% spot for batch jobs (Spark recomputes lost tasks)
  • Streaming: Use on-demand driver + spot workers with fallback to on-demand
  • Spot fallback: Enable "fall back to on-demand" to prevent job failures during spot shortages
  • Instance diversity: Allow multiple instance types to improve spot availability

DBU Cost Breakdown

Where Your Databricks Dollars Go
Compute (40%) DBU Fees (30%) Storage (15%) Net (10%) 5% Cloud VM compute (EC2/Azure VMs) Databricks DBU fees (platform markup) Cloud storage (S3/ADLS/GCS) Data transfer and networking Other (logging, monitoring, secrets) Biggest savings: right-size clusters (compute) + use job clusters (reduce DBU rate by 60%)

Cluster Policies

Key Takeaway: Spot instances save 30-90% on compute but should never be used for driver nodes. Always enable "fall back to on-demand" to prevent job failures during spot shortages.

Cluster policies enforce guardrails on cluster configurations, preventing users from creating expensive clusters. Policies can restrict instance types, maximum workers, auto-termination settings, and Spark configurations.

JSON - Cluster Policy Definition
{
  "spark_version": {
    "type": "fixed",
    "value": "14.3.x-scala2.12",
    "hidden": true
  },
  "autotermination_minutes": {
    "type": "range",
    "minValue": 10,
    "maxValue": 120,
    "defaultValue": 30
  },
  "num_workers": {
    "type": "range",
    "minValue": 1,
    "maxValue": 10,
    "defaultValue": 2
  },
  "node_type_id": {
    "type": "allowlist",
    "values": [
      {"label": "Small (4 cores, 16 GB)", "value": "Standard_D4s_v3"},
      {"label": "Medium (8 cores, 32 GB)", "value": "Standard_D8s_v3"},
      {"label": "Large (16 cores, 64 GB)", "value": "Standard_D16s_v3"}
    ],
    "defaultValue": "Standard_D4s_v3"
  },
  "aws_attributes.first_on_demand": {
    "type": "fixed",
    "value": 1,
    "hidden": true
  },
  "custom_tags.CostCenter": {
    "type": "fixed",
    "value": "data-engineering"
  }
}

Storage Optimization

Common Mistake

Wrong: Setting autotermination_minutes to 0 (disabled) on all-purpose clusters

Why it fails: Clusters run 24/7 even when idle, costing $500-$2000/month per cluster in wasted compute.

Instead: Always set autotermination_minutes between 10-120. Use cluster policies to enforce this.

Storage costs accumulate from Delta Lake tables, old versions retained by time travel, and small files. Regular maintenance keeps storage costs in check.

SQL - Storage Optimization Commands
-- Check table size and file count
DESCRIBE DETAIL catalog.schema.large_table;

-- VACUUM: Remove old files beyond retention period
-- Default retention: 7 days (168 hours)
VACUUM catalog.schema.large_table RETAIN 168 HOURS;

-- OPTIMIZE: Compact small files into larger ones
OPTIMIZE catalog.schema.large_table;

-- Set shorter retention for non-critical tables
ALTER TABLE catalog.schema.staging_table
SET TBLPROPERTIES (
    delta.deletedFileRetentionDuration = 'interval 24 hours',
    delta.logRetentionDuration = 'interval 7 days'
);

-- Find largest tables in a schema
SELECT
    table_name,
    ROUND(data_source_format_size / 1073741824, 2) AS size_gb
FROM information_schema.tables
WHERE table_schema = 'schema_name'
ORDER BY data_source_format_size DESC
LIMIT 20;

Query Optimization Tips

Output
-- DESCRIBE DETAIL output:
format: delta | numFiles: 12,847 | sizeInBytes: 48,392,017,920
-- After OPTIMIZE:
format: delta | numFiles: 384 | sizeInBytes: 42,156,892,160 (13% smaller)
Key Takeaway: Schedule VACUUM (removes old versions) and OPTIMIZE (compacts small files) weekly on large Delta tables. This alone can reduce storage costs by 30-50%.

Use Partition Pruning

Always include partition columns in WHERE clauses. A query filtering on a partitioned date column can skip 99% of the data.

Avoid SELECT *

Select only the columns you need. Columnar formats like Parquet skip reading unneeded columns, saving I/O and compute.

Prefer SQL over UDFs

Built-in SQL functions run in Photon/JVM and are 10-100x faster than Python UDFs. Rewrite UDFs as SQL expressions where possible.

Use Delta Lake Caching

Delta cache on SSD-equipped instances caches frequently accessed data automatically. Use i3 or L-series instances for cache-heavy workloads.

Cost Monitoring

SQL - Cost Monitoring Queries
-- Query the system billing usage table
-- Available in system.billing.usage (Unity Catalog)
SELECT
    usage_date,
    workspace_id,
    sku_name,
    usage_unit,
    SUM(usage_quantity) AS total_dbus,
    ROUND(SUM(usage_quantity) * 0.22, 2) AS estimated_cost_usd
FROM system.billing.usage
WHERE usage_date >= current_date() - INTERVAL 30 DAYS
GROUP BY usage_date, workspace_id, sku_name, usage_unit
ORDER BY estimated_cost_usd DESC;

-- Find most expensive clusters
SELECT
    cluster_id,
    custom_tags:'Team' AS team,
    SUM(usage_quantity) AS total_dbus,
    ROUND(SUM(usage_quantity) * 0.22, 2) AS estimated_cost
FROM system.billing.usage
WHERE usage_date >= current_date() - INTERVAL 7 DAYS
    AND sku_name LIKE '%ALL_PURPOSE%'
GROUP BY cluster_id, team
ORDER BY total_dbus DESC
LIMIT 10;

-- Detect idle clusters (low DBU usage)
SELECT
    cluster_id,
    SUM(usage_quantity) AS total_dbus,
    COUNT(DISTINCT usage_date) AS active_days,
    ROUND(SUM(usage_quantity) / COUNT(DISTINCT usage_date), 1) AS dbus_per_day
FROM system.billing.usage
WHERE usage_date >= current_date() - INTERVAL 7 DAYS
GROUP BY cluster_id
HAVING dbus_per_day < 5
ORDER BY dbus_per_day ASC;
Output
usage_date   | sku_name              | total_dbus | estimated_cost_usd
2024-03-15   | JOBS_COMPUTE          | 1,245.3    | $273.97
2024-03-15   | ALL_PURPOSE_COMPUTE   | 892.7      | $196.39
2024-03-15   | SQL_COMPUTE           | 445.1      | $97.92
Deep Dive: Photon Engine Cost-Benefit Analysis

Photon costs ~2x more per DBU but runs SQL/DataFrame operations 2-8x faster. The net effect is often cheaper total cost for SQL-heavy ETL workloads. Example: A 4-hour ETL job on standard compute (400 DBUs * $0.15 = $60) finishes in 1 hour on Photon (100 DBUs * $0.30 = $30) -- 50% savings despite the higher per-DBU rate. Photon is most effective for aggregations, joins, and window functions. It provides less benefit for Python UDFs or ML training.

Practice Problems

Problem 1: Cost Reduction Plan

Medium

Your team's Databricks bill is $50K/month. Analysis shows: 60% is all-purpose clusters used for scheduled ETL, 25% is interactive clusters left running overnight, 15% is SQL Warehouse usage. Design a plan to reduce costs by 40%.

Problem 2: Cluster Policy Design

Medium

Design cluster policies for three teams: Data Engineering (needs up to 32 workers for large ETL), Data Science (needs GPU instances), and Business Analytics (only needs small clusters for queries). Include cost controls.

Problem 3: Storage Cost Investigation

Medium

Your cloud storage bill for Databricks data has tripled in 6 months despite data volume only doubling. What are the likely causes and how would you investigate?

Common Mistake

Wrong: VACUUM catalog.schema.table RETAIN 0 HOURS

Why it fails: Deleting all old file versions breaks concurrent readers and makes time travel impossible. Active queries may fail with FileNotFoundException.

Instead: VACUUM catalog.schema.table RETAIN 168 HOURS (7-day default minimum)

Quick Reference

OptimizationExpected SavingsEffort
Job clusters for ETL60% DBU cost reductionLow (config change)
Spot instances30-90% compute savingsLow (enable in policy)
Auto-terminationEliminates idle costsLow (cluster policy)
SQL ServerlessScale-to-zero, no idleLow (switch warehouse)
VACUUM + OPTIMIZE30-50% storage reductionMedium (schedule jobs)
Photon2-8x faster (lower total cost)Low (cluster type)
Cluster policiesPrevents over-provisioningMedium (define + assign)