All-Purpose vs Job Clusters
Why Cluster Types Matter
The Problem: Running all workloads on a single cluster type wastes money and creates resource contention between interactive development and production jobs.
The Solution: Databricks provides two cluster types optimized for different use cases -- all-purpose clusters for interactive work and job clusters for automated production workloads.
Cost Impact: Using job clusters instead of all-purpose clusters for production workloads can reduce compute costs by 30-60% through automatic termination and right-sizing.
Real-World Analogy
Think of clusters like vehicles:
- All-Purpose Cluster = Your personal car -- always available, you control when it starts and stops, shared with passengers (collaborators)
- Job Cluster = A rental car -- created for a specific trip (job), returned (terminated) automatically when the trip is done
| Feature | All-Purpose Cluster | Job Cluster |
|---|---|---|
| Use Case | Interactive development, ad-hoc analysis | Scheduled production jobs |
| Creation | Manually via UI/API | Automatically by job scheduler |
| Lifecycle | Persists until manually terminated | Created at job start, terminated at job end |
| Multi-User | Multiple users can attach notebooks | Dedicated to a single job run |
| Cost | Higher (runs even when idle) | Lower (pay only for job duration) |
| DBU Rate | All-Purpose Compute rate | Jobs Compute rate (typically 2-3x cheaper) |
import requests
import json
DATABRICKS_HOST = "https://adb-1234567890.12.azuredatabricks.net"
TOKEN = "dapi_your_token"
headers = {"Authorization": f"Bearer {TOKEN}"}
# Create an All-Purpose Cluster
all_purpose_config = {
"cluster_name": "dev-interactive",
"spark_version": "14.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 2,
"autotermination_minutes": 30,
"spark_conf": {
"spark.databricks.cluster.profile": "serverless"
}
}
response = requests.post(
f"{DATABRICKS_HOST}/api/2.0/clusters/create",
headers=headers,
json=all_purpose_config
)
cluster_id = response.json()["cluster_id"]
print(f"Created cluster: {cluster_id}")
# Job clusters are defined within job configurations
job_cluster_config = {
"new_cluster": {
"spark_version": "14.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 4,
}
}
Cluster Configuration
Configuring a cluster correctly is critical for performance and cost optimization. Here are the key parameters you need to understand.
Databricks Runtime Versions
Every cluster runs a Databricks Runtime (DBR) which bundles Apache Spark, Delta Lake, and other libraries. Choose the right runtime for your workload:
Standard Runtime
General-purpose runtime with Spark, Delta Lake, and common libraries. Best for data engineering and general analytics workloads.
ML Runtime
Includes popular ML frameworks (PyTorch, TensorFlow, scikit-learn, XGBoost) pre-installed. Saves time setting up ML environments.
Photon Runtime
Native C++ query engine that accelerates SQL and DataFrame operations by up to 12x. Ideal for SQL-heavy and ETL workloads.
GPU Runtime
Includes GPU drivers and CUDA toolkit for deep learning training and inference. Required for GPU-accelerated clusters.
Node Types and Sizing
# List available node types for your cloud
response = requests.get(
f"{DATABRICKS_HOST}/api/2.0/clusters/list-node-types",
headers=headers
)
node_types = response.json()["node_types"]
for nt in node_types[:5]:
print(f"{nt['node_type_id']}: {nt['memory_mb']}MB, {nt['num_cores']} cores")
# Common sizing guidelines:
# Memory-optimized: Large shuffles, caching, ML training
# Compute-optimized: CPU-intensive transformations
# Storage-optimized: Large Delta Lake reads/writes
# GPU-accelerated: Deep learning, NLP models
Spark Configuration
# Key Spark configurations for cluster setup
spark_conf = {
# Adaptive Query Execution (enabled by default in DBR 14+)
"spark.sql.adaptive.enabled": "true",
# Dynamic partition pruning
"spark.sql.optimizer.dynamicPartitionPruning.enabled": "true",
# Shuffle partitions -- tune based on data size
"spark.sql.shuffle.partitions": "auto",
# Delta Lake optimizations
"spark.databricks.delta.optimizeWrite.enabled": "true",
"spark.databricks.delta.autoCompact.enabled": "true",
}
# Set configs at runtime in a notebook
spark.conf.set("spark.sql.shuffle.partitions", "200")
# Check current configuration
print(spark.conf.get("spark.sql.shuffle.partitions"))
Autoscaling
Autoscaling dynamically adjusts the number of worker nodes based on workload demand. This helps optimize cost while maintaining performance during peak loads.
How Autoscaling Works
Databricks monitors cluster utilization and adjusts the number of workers between your configured min and max values:
- Scale Up: When pending tasks exceed available slots, new nodes are added within 2-5 minutes
- Scale Down: When nodes are idle for a configurable period, they are removed gracefully
- Optimized Autoscaling: Databricks-enhanced version that scales more aggressively than standard Spark autoscaling
# Create cluster with autoscaling enabled
autoscaling_cluster = {
"cluster_name": "autoscaling-etl",
"spark_version": "14.3.x-scala2.12",
"node_type_id": "Standard_DS4_v2",
"autoscale": {
"min_workers": 2,
"max_workers": 10
},
"autotermination_minutes": 30
}
# Monitor autoscaling activity
cluster_info = requests.get(
f"{DATABRICKS_HOST}/api/2.0/clusters/get",
headers=headers,
params={"cluster_id": cluster_id}
).json()
print(f"Current workers: {cluster_info.get('num_workers', 'N/A')}")
print(f"State: {cluster_info['state']}")
Autoscaling Best Practices
| Scenario | Min Workers | Max Workers | Rationale |
|---|---|---|---|
| Interactive Development | 1 | 4 | Keep min low to save cost during idle time |
| ETL Pipeline | 2 | 10 | Ensure baseline capacity, scale for large loads |
| ML Training | 4 | 4 | Fixed size -- ML often needs consistent resources |
| Streaming | 2 | 8 | Handle traffic spikes while maintaining low latency |
Spot vs On-Demand Instances
Cloud providers offer unused compute capacity at significant discounts (60-90% off) called spot instances (AWS), low-priority VMs (Azure), or preemptible VMs (GCP). The trade-off is that these can be reclaimed with short notice.
Cost Savings Strategy
A common production pattern: use on-demand instances for the driver node (critical -- losing the driver kills the job) and spot instances for worker nodes (Spark can recover from worker loss through task retry).
# Cluster with spot instances for workers
spot_cluster = {
"cluster_name": "cost-optimized-etl",
"spark_version": "14.3.x-scala2.12",
"node_type_id": "Standard_DS4_v2",
"num_workers": 4,
# AWS: use spot instances
"aws_attributes": {
"first_on_demand": 1, # Driver on-demand
"availability": "SPOT_WITH_FALLBACK",
"spot_bid_price_percent": 100,
"zone_id": "auto"
},
# Azure: use spot VMs
"azure_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": -1
}
}
# Availability options:
# SPOT - Only spot (cheapest, risk of interruption)
# SPOT_WITH_FALLBACK - Try spot, fall back to on-demand
# ON_DEMAND - Only on-demand (most reliable)
Cluster Lifecycle
Understanding cluster states and transitions is essential for managing costs and debugging issues. Every cluster moves through a well-defined set of states.
Managing Cluster Lifecycle
# Start a terminated cluster
requests.post(
f"{DATABRICKS_HOST}/api/2.0/clusters/start",
headers=headers,
json={"cluster_id": cluster_id}
)
# Terminate a running cluster (graceful shutdown)
requests.post(
f"{DATABRICKS_HOST}/api/2.0/clusters/delete",
headers=headers,
json={"cluster_id": cluster_id}
)
# Restart a cluster (useful after config changes)
requests.post(
f"{DATABRICKS_HOST}/api/2.0/clusters/restart",
headers=headers,
json={"cluster_id": cluster_id}
)
# Poll cluster state until running
import time
def wait_for_cluster(cluster_id, target="RUNNING", timeout=600):
start = time.time()
while time.time() - start < timeout:
info = requests.get(
f"{DATABRICKS_HOST}/api/2.0/clusters/get",
headers=headers,
params={"cluster_id": cluster_id}
).json()
state = info["state"]
print(f"State: {state}")
if state == target:
return info
if state in ["TERMINATED", "ERROR"]:
raise Exception(f"Cluster entered {state}")
time.sleep(15)
Cluster Policies
Cluster policies let admins control what users can configure, enforcing cost guardrails and standardization across teams.
{
"name": "Cost-Controlled Development",
"definition": {
"autotermination_minutes": {
"type": "range",
"maxValue": 60,
"defaultValue": 30
},
"num_workers": {
"type": "range",
"maxValue": 8,
"defaultValue": 2
},
"node_type_id": {
"type": "allowlist",
"values": ["Standard_DS3_v2", "Standard_DS4_v2"]
}
}
}
Practice Problems
Problem 1: Cluster Type Selection
EasyYour team has three workloads: (1) data scientists running ad-hoc queries throughout the day, (2) a nightly ETL pipeline that processes 500GB of data, and (3) a weekly ML training job. What cluster configuration would you recommend for each?
Problem 2: Cost Optimization
EasyYour Databricks bill is $15,000/month. All-purpose clusters run 24/7 and account for 70% of the cost. What changes would you make to reduce the bill by at least 40%?
Problem 3: Cluster Policy Design
MediumWrite a cluster policy that limits max workers to 10, requires auto-termination between 15-60 minutes, restricts to only two node types, and forces Photon runtime. How would you apply this to your data engineering team?
Quick Reference
| Concept | Description | Key Point |
|---|---|---|
| All-Purpose Cluster | Interactive, multi-user compute | Higher DBU rate, persists until terminated |
| Job Cluster | Ephemeral, single-job compute | Lower DBU rate, auto-terminates after job |
| Autoscaling | Dynamic worker count adjustment | Set min/max workers for cost and performance |
| Spot Instances | Discounted cloud compute (60-90% off) | Use for workers; keep driver on-demand |
| Cluster Policies | Admin-defined configuration guardrails | Enforce cost limits and standardization |
| Auto-Termination | Automatic shutdown after idle period | Set to 30 min for dev, 0 for streaming |
| Photon | Native C++ query engine | Up to 12x faster for SQL/DataFrame workloads |