Cluster Right-Sizing
Why Cost Optimization Matters
The Problem: Databricks costs can escalate quickly when clusters are over-provisioned, left running idle, or when workloads are not optimized. A single misconfigured cluster can waste thousands of dollars per month.
The Solution: Systematic cost optimization through cluster right-sizing, spot instances, auto-scaling, cluster policies, and workload-aware compute selection can reduce Databricks spend by 30-60%.
Real Impact: Organizations that implement these practices typically save $50K-$500K annually depending on their Databricks footprint.
The most impactful cost optimization is choosing the right compute type for each workload. Job clusters cost 60% less than all-purpose clusters per DBU. SQL Warehouses with serverless eliminate idle costs entirely.
| Workload Type | Recommended Compute | Auto-Terminate | Spot Workers |
|---|---|---|---|
| Scheduled ETL | Job clusters | Immediate (job scope) | Yes (50-100%) |
| Ad-hoc Analysis | All-purpose or SQL Warehouse | 30 min | Optional |
| Streaming | Job clusters (always-on) | Never (monitored) | Workers only |
| ML Training | Job clusters with GPU | Immediate | Workers only |
| Dashboard Queries | SQL Warehouse (serverless) | Scale-to-zero | N/A (managed) |
Spot Instances
Spot instances (AWS) or Spot VMs (Azure) provide up to 90% discount on compute costs. The tradeoff is that the cloud provider can reclaim them with short notice. Databricks handles spot interruptions gracefully by reassigning tasks to remaining nodes.
Spot Instance Best Practices
- Driver node: Always use on-demand (spot interruption kills the entire job)
- Worker nodes: Use 50-100% spot for batch jobs (Spark recomputes lost tasks)
- Streaming: Use on-demand driver + spot workers with fallback to on-demand
- Spot fallback: Enable "fall back to on-demand" to prevent job failures during spot shortages
- Instance diversity: Allow multiple instance types to improve spot availability
DBU Cost Breakdown
Cluster Policies
Cluster policies enforce guardrails on cluster configurations, preventing users from creating expensive clusters. Policies can restrict instance types, maximum workers, auto-termination settings, and Spark configurations.
{
"spark_version": {
"type": "fixed",
"value": "14.3.x-scala2.12",
"hidden": true
},
"autotermination_minutes": {
"type": "range",
"minValue": 10,
"maxValue": 120,
"defaultValue": 30
},
"num_workers": {
"type": "range",
"minValue": 1,
"maxValue": 10,
"defaultValue": 2
},
"node_type_id": {
"type": "allowlist",
"values": [
{"label": "Small (4 cores, 16 GB)", "value": "Standard_D4s_v3"},
{"label": "Medium (8 cores, 32 GB)", "value": "Standard_D8s_v3"},
{"label": "Large (16 cores, 64 GB)", "value": "Standard_D16s_v3"}
],
"defaultValue": "Standard_D4s_v3"
},
"aws_attributes.first_on_demand": {
"type": "fixed",
"value": 1,
"hidden": true
},
"custom_tags.CostCenter": {
"type": "fixed",
"value": "data-engineering"
}
}
Storage Optimization
Storage costs accumulate from Delta Lake tables, old versions retained by time travel, and small files. Regular maintenance keeps storage costs in check.
-- Check table size and file count
DESCRIBE DETAIL catalog.schema.large_table;
-- VACUUM: Remove old files beyond retention period
-- Default retention: 7 days (168 hours)
VACUUM catalog.schema.large_table RETAIN 168 HOURS;
-- OPTIMIZE: Compact small files into larger ones
OPTIMIZE catalog.schema.large_table;
-- Set shorter retention for non-critical tables
ALTER TABLE catalog.schema.staging_table
SET TBLPROPERTIES (
delta.deletedFileRetentionDuration = 'interval 24 hours',
delta.logRetentionDuration = 'interval 7 days'
);
-- Find largest tables in a schema
SELECT
table_name,
ROUND(data_source_format_size / 1073741824, 2) AS size_gb
FROM information_schema.tables
WHERE table_schema = 'schema_name'
ORDER BY data_source_format_size DESC
LIMIT 20;
Query Optimization Tips
Use Partition Pruning
Always include partition columns in WHERE clauses. A query filtering on a partitioned date column can skip 99% of the data.
Avoid SELECT *
Select only the columns you need. Columnar formats like Parquet skip reading unneeded columns, saving I/O and compute.
Prefer SQL over UDFs
Built-in SQL functions run in Photon/JVM and are 10-100x faster than Python UDFs. Rewrite UDFs as SQL expressions where possible.
Use Delta Lake Caching
Delta cache on SSD-equipped instances caches frequently accessed data automatically. Use i3 or L-series instances for cache-heavy workloads.
Cost Monitoring
-- Query the system billing usage table
-- Available in system.billing.usage (Unity Catalog)
SELECT
usage_date,
workspace_id,
sku_name,
usage_unit,
SUM(usage_quantity) AS total_dbus,
ROUND(SUM(usage_quantity) * 0.22, 2) AS estimated_cost_usd
FROM system.billing.usage
WHERE usage_date >= current_date() - INTERVAL 30 DAYS
GROUP BY usage_date, workspace_id, sku_name, usage_unit
ORDER BY estimated_cost_usd DESC;
-- Find most expensive clusters
SELECT
cluster_id,
custom_tags:'Team' AS team,
SUM(usage_quantity) AS total_dbus,
ROUND(SUM(usage_quantity) * 0.22, 2) AS estimated_cost
FROM system.billing.usage
WHERE usage_date >= current_date() - INTERVAL 7 DAYS
AND sku_name LIKE '%ALL_PURPOSE%'
GROUP BY cluster_id, team
ORDER BY total_dbus DESC
LIMIT 10;
-- Detect idle clusters (low DBU usage)
SELECT
cluster_id,
SUM(usage_quantity) AS total_dbus,
COUNT(DISTINCT usage_date) AS active_days,
ROUND(SUM(usage_quantity) / COUNT(DISTINCT usage_date), 1) AS dbus_per_day
FROM system.billing.usage
WHERE usage_date >= current_date() - INTERVAL 7 DAYS
GROUP BY cluster_id
HAVING dbus_per_day < 5
ORDER BY dbus_per_day ASC;
Practice Problems
Problem 1: Cost Reduction Plan
MediumYour team's Databricks bill is $50K/month. Analysis shows: 60% is all-purpose clusters used for scheduled ETL, 25% is interactive clusters left running overnight, 15% is SQL Warehouse usage. Design a plan to reduce costs by 40%.
Problem 2: Cluster Policy Design
MediumDesign cluster policies for three teams: Data Engineering (needs up to 32 workers for large ETL), Data Science (needs GPU instances), and Business Analytics (only needs small clusters for queries). Include cost controls.
Problem 3: Storage Cost Investigation
MediumYour cloud storage bill for Databricks data has tripled in 6 months despite data volume only doubling. What are the likely causes and how would you investigate?
Quick Reference
| Optimization | Expected Savings | Effort |
|---|---|---|
| Job clusters for ETL | 60% DBU cost reduction | Low (config change) |
| Spot instances | 30-90% compute savings | Low (enable in policy) |
| Auto-termination | Eliminates idle costs | Low (cluster policy) |
| SQL Serverless | Scale-to-zero, no idle | Low (switch warehouse) |
| VACUUM + OPTIMIZE | 30-50% storage reduction | Medium (schedule jobs) |
| Photon | 2-8x faster (lower total cost) | Low (cluster type) |
| Cluster policies | Prevents over-provisioning | Medium (define + assign) |