Databricks Best Practices & Cost Optimization

Cluster Right-Sizing

Why Cost Optimization Matters

The Problem: Databricks costs can escalate quickly when clusters are over-provisioned, left running idle, or when workloads are not optimized. A single misconfigured cluster can waste thousands of dollars per month.

The Solution: Systematic cost optimization through cluster right-sizing, spot instances, auto-scaling, cluster policies, and workload-aware compute selection can reduce Databricks spend by 30-60%.

Real Impact: Organizations that implement these practices typically save $50K-$500K annually depending on their Databricks footprint.

Cost Optimization Decision Tree

The most impactful cost optimization is choosing the right compute type for each workload. Job clusters cost 60% less than all-purpose clusters per DBU. SQL Warehouses with serverless eliminate idle costs entirely.

Workload Type	Recommended Compute	Auto-Terminate	Spot Workers
Scheduled ETL	Job clusters	Immediate (job scope)	Yes (50-100%)
Ad-hoc Analysis	All-purpose or SQL Warehouse	30 min	Optional
Streaming	Job clusters (always-on)	Never (monitored)	Workers only
ML Training	Job clusters with GPU	Immediate	Workers only
Dashboard Queries	SQL Warehouse (serverless)	Scale-to-zero	N/A (managed)

Spot Instances

Key Takeaway: Always match compute type to workload -- job clusters for ETL (60% cheaper), SQL Warehouses for analytics (scale-to-zero), and all-purpose clusters only for interactive development.

Spot instances (AWS) or Spot VMs (Azure) provide up to 90% discount on compute costs. The tradeoff is that the cloud provider can reclaim them with short notice. Databricks handles spot interruptions gracefully by reassigning tasks to remaining nodes.

Spot Instance Best Practices

Driver node: Always use on-demand (spot interruption kills the entire job)
Worker nodes: Use 50-100% spot for batch jobs (Spark recomputes lost tasks)
Streaming: Use on-demand driver + spot workers with fallback to on-demand
Spot fallback: Enable "fall back to on-demand" to prevent job failures during spot shortages
Instance diversity: Allow multiple instance types to improve spot availability

DBU Cost Breakdown

Where Your Databricks Dollars Go

Cluster Policies

Key Takeaway: Spot instances save 30-90% on compute but should never be used for driver nodes. Always enable "fall back to on-demand" to prevent job failures during spot shortages.

Cluster policies enforce guardrails on cluster configurations, preventing users from creating expensive clusters. Policies can restrict instance types, maximum workers, auto-termination settings, and Spark configurations.

JSON - Cluster Policy Definition

{
  "spark_version": {
    "type": "fixed",
    "value": "14.3.x-scala2.12",
    "hidden": true
  },
  "autotermination_minutes": {
    "type": "range",
    "minValue": 10,
    "maxValue": 120,
    "defaultValue": 30
  },
  "num_workers": {
    "type": "range",
    "minValue": 1,
    "maxValue": 10,
    "defaultValue": 2
  },
  "node_type_id": {
    "type": "allowlist",
    "values": [
      {"label": "Small (4 cores, 16 GB)", "value": "Standard_D4s_v3"},
      {"label": "Medium (8 cores, 32 GB)", "value": "Standard_D8s_v3"},
      {"label": "Large (16 cores, 64 GB)", "value": "Standard_D16s_v3"}
    ],
    "defaultValue": "Standard_D4s_v3"
  },
  "aws_attributes.first_on_demand": {
    "type": "fixed",
    "value": 1,
    "hidden": true
  },
  "custom_tags.CostCenter": {
    "type": "fixed",
    "value": "data-engineering"
  }
}

Storage Optimization

Common Mistake

Wrong: Setting autotermination_minutes to 0 (disabled) on all-purpose clusters

Why it fails: Clusters run 24/7 even when idle, costing $500-$2000/month per cluster in wasted compute.

Instead: Always set autotermination_minutes between 10-120. Use cluster policies to enforce this.

Storage costs accumulate from Delta Lake tables, old versions retained by time travel, and small files. Regular maintenance keeps storage costs in check.

SQL - Storage Optimization Commands

-- Check table size and file count
DESCRIBE DETAIL catalog.schema.large_table;

-- VACUUM: Remove old files beyond retention period
-- Default retention: 7 days (168 hours)
VACUUM catalog.schema.large_table RETAIN 168 HOURS;

-- OPTIMIZE: Compact small files into larger ones
OPTIMIZE catalog.schema.large_table;

-- Set shorter retention for non-critical tables
ALTER TABLE catalog.schema.staging_table
SET TBLPROPERTIES (
    delta.deletedFileRetentionDuration = 'interval 24 hours',
    delta.logRetentionDuration = 'interval 7 days'
);

-- Find largest tables in a schema
SELECT
    table_name,
    ROUND(data_source_format_size / 1073741824, 2) AS size_gb
FROM information_schema.tables
WHERE table_schema = 'schema_name'
ORDER BY data_source_format_size DESC
LIMIT 20;

Query Optimization Tips

Output

-- DESCRIBE DETAIL output:
format: delta | numFiles: 12,847 | sizeInBytes: 48,392,017,920
-- After OPTIMIZE:
format: delta | numFiles: 384 | sizeInBytes: 42,156,892,160 (13% smaller)

Key Takeaway: Schedule VACUUM (removes old versions) and OPTIMIZE (compacts small files) weekly on large Delta tables. This alone can reduce storage costs by 30-50%.

Use Partition Pruning

Always include partition columns in WHERE clauses. A query filtering on a partitioned date column can skip 99% of the data.

Avoid SELECT *

Select only the columns you need. Columnar formats like Parquet skip reading unneeded columns, saving I/O and compute.

Prefer SQL over UDFs

Built-in SQL functions run in Photon/JVM and are 10-100x faster than Python UDFs. Rewrite UDFs as SQL expressions where possible.

Use Delta Lake Caching

Delta cache on SSD-equipped instances caches frequently accessed data automatically. Use i3 or L-series instances for cache-heavy workloads.

Cost Monitoring

SQL - Cost Monitoring Queries

-- Query the system billing usage table
-- Available in system.billing.usage (Unity Catalog)
SELECT
    usage_date,
    workspace_id,
    sku_name,
    usage_unit,
    SUM(usage_quantity) AS total_dbus,
    ROUND(SUM(usage_quantity) * 0.22, 2) AS estimated_cost_usd
FROM system.billing.usage
WHERE usage_date >= current_date() - INTERVAL 30 DAYS
GROUP BY usage_date, workspace_id, sku_name, usage_unit
ORDER BY estimated_cost_usd DESC;

-- Find most expensive clusters
SELECT
    cluster_id,
    custom_tags:'Team' AS team,
    SUM(usage_quantity) AS total_dbus,
    ROUND(SUM(usage_quantity) * 0.22, 2) AS estimated_cost
FROM system.billing.usage
WHERE usage_date >= current_date() - INTERVAL 7 DAYS
    AND sku_name LIKE '%ALL_PURPOSE%'
GROUP BY cluster_id, team
ORDER BY total_dbus DESC
LIMIT 10;

-- Detect idle clusters (low DBU usage)
SELECT
    cluster_id,
    SUM(usage_quantity) AS total_dbus,
    COUNT(DISTINCT usage_date) AS active_days,
    ROUND(SUM(usage_quantity) / COUNT(DISTINCT usage_date), 1) AS dbus_per_day
FROM system.billing.usage
WHERE usage_date >= current_date() - INTERVAL 7 DAYS
GROUP BY cluster_id
HAVING dbus_per_day < 5
ORDER BY dbus_per_day ASC;

Output

usage_date   | sku_name              | total_dbus | estimated_cost_usd
2024-03-15   | JOBS_COMPUTE          | 1,245.3    | $273.97
2024-03-15   | ALL_PURPOSE_COMPUTE   | 892.7      | $196.39
2024-03-15   | SQL_COMPUTE           | 445.1      | $97.92

Deep Dive: Photon Engine Cost-Benefit Analysis

Photon costs ~2x more per DBU but runs SQL/DataFrame operations 2-8x faster. The net effect is often cheaper total cost for SQL-heavy ETL workloads. Example: A 4-hour ETL job on standard compute (400 DBUs * $0.15 = $60) finishes in 1 hour on Photon (100 DBUs * $0.30 = $30) -- 50% savings despite the higher per-DBU rate. Photon is most effective for aggregations, joins, and window functions. It provides less benefit for Python UDFs or ML training.

Practice Problems

Problem 1: Cost Reduction Plan

Medium

Your team's Databricks bill is $50K/month. Analysis shows: 60% is all-purpose clusters used for scheduled ETL, 25% is interactive clusters left running overnight, 15% is SQL Warehouse usage. Design a plan to reduce costs by 40%.

Problem 2: Cluster Policy Design

Medium

Design cluster policies for three teams: Data Engineering (needs up to 32 workers for large ETL), Data Science (needs GPU instances), and Business Analytics (only needs small clusters for queries). Include cost controls.

Problem 3: Storage Cost Investigation

Medium

Your cloud storage bill for Databricks data has tripled in 6 months despite data volume only doubling. What are the likely causes and how would you investigate?

Common Mistake

Wrong: VACUUM catalog.schema.table RETAIN 0 HOURS

Why it fails: Deleting all old file versions breaks concurrent readers and makes time travel impossible. Active queries may fail with FileNotFoundException.

Instead: VACUUM catalog.schema.table RETAIN 168 HOURS (7-day default minimum)

Quick Reference

Optimization	Expected Savings	Effort
Job clusters for ETL	60% DBU cost reduction	Low (config change)
Spot instances	30-90% compute savings	Low (enable in policy)
Auto-termination	Eliminates idle costs	Low (cluster policy)
SQL Serverless	Scale-to-zero, no idle	Low (switch warehouse)
VACUUM + OPTIMIZE	30-50% storage reduction	Medium (schedule jobs)
Photon	2-8x faster (lower total cost)	Low (cluster type)
Cluster policies	Prevents over-provisioning	Medium (define + assign)

Cluster Right-Sizing

Why Cost Optimization Matters

Spot Instances

Spot Instance Best Practices

DBU Cost Breakdown

Cluster Policies

Storage Optimization

Common Mistake

Query Optimization Tips

Use Partition Pruning

Avoid SELECT *

Prefer SQL over UDFs

Use Delta Lake Caching

Cost Monitoring

Practice Problems

Problem 1: Cost Reduction Plan

Problem 2: Cluster Policy Design

Problem 3: Storage Cost Investigation

Common Mistake

Quick Reference

Related Topics