Clusters & Compute Management

Easy 25 min read

All-Purpose vs Job Clusters

Why Cluster Types Matter

The Problem: Running all workloads on a single cluster type wastes money and creates resource contention between interactive development and production jobs.

The Solution: Databricks provides two cluster types optimized for different use cases -- all-purpose clusters for interactive work and job clusters for automated production workloads.

Cost Impact: Using job clusters instead of all-purpose clusters for production workloads can reduce compute costs by 30-60% through automatic termination and right-sizing.

Real-World Analogy

Think of clusters like vehicles:

  • All-Purpose Cluster = Your personal car -- always available, you control when it starts and stops, shared with passengers (collaborators)
  • Job Cluster = A rental car -- created for a specific trip (job), returned (terminated) automatically when the trip is done
Feature All-Purpose Cluster Job Cluster
Use Case Interactive development, ad-hoc analysis Scheduled production jobs
Creation Manually via UI/API Automatically by job scheduler
Lifecycle Persists until manually terminated Created at job start, terminated at job end
Multi-User Multiple users can attach notebooks Dedicated to a single job run
Cost Higher (runs even when idle) Lower (pay only for job duration)
DBU Rate All-Purpose Compute rate Jobs Compute rate (typically 2-3x cheaper)
Python - Creating Clusters via API
import requests
import json

DATABRICKS_HOST = "https://adb-1234567890.12.azuredatabricks.net"
TOKEN = "dapi_your_token"
headers = {"Authorization": f"Bearer {TOKEN}"}

# Create an All-Purpose Cluster
all_purpose_config = {
    "cluster_name": "dev-interactive",
    "spark_version": "14.3.x-scala2.12",
    "node_type_id": "Standard_DS3_v2",
    "num_workers": 2,
    "autotermination_minutes": 30,
    "spark_conf": {
        "spark.databricks.cluster.profile": "serverless"
    }
}

response = requests.post(
    f"{DATABRICKS_HOST}/api/2.0/clusters/create",
    headers=headers,
    json=all_purpose_config
)
cluster_id = response.json()["cluster_id"]
print(f"Created cluster: {cluster_id}")

# Job clusters are defined within job configurations
job_cluster_config = {
    "new_cluster": {
        "spark_version": "14.3.x-scala2.12",
        "node_type_id": "Standard_DS3_v2",
        "num_workers": 4,
    }
}

Cluster Configuration

Configuring a cluster correctly is critical for performance and cost optimization. Here are the key parameters you need to understand.

Databricks Runtime Versions

Every cluster runs a Databricks Runtime (DBR) which bundles Apache Spark, Delta Lake, and other libraries. Choose the right runtime for your workload:

Standard Runtime

General-purpose runtime with Spark, Delta Lake, and common libraries. Best for data engineering and general analytics workloads.

ML Runtime

Includes popular ML frameworks (PyTorch, TensorFlow, scikit-learn, XGBoost) pre-installed. Saves time setting up ML environments.

Photon Runtime

Native C++ query engine that accelerates SQL and DataFrame operations by up to 12x. Ideal for SQL-heavy and ETL workloads.

GPU Runtime

Includes GPU drivers and CUDA toolkit for deep learning training and inference. Required for GPU-accelerated clusters.

Node Types and Sizing

Python - List Available Node Types
# List available node types for your cloud
response = requests.get(
    f"{DATABRICKS_HOST}/api/2.0/clusters/list-node-types",
    headers=headers
)

node_types = response.json()["node_types"]
for nt in node_types[:5]:
    print(f"{nt['node_type_id']}: {nt['memory_mb']}MB, {nt['num_cores']} cores")

# Common sizing guidelines:
# Memory-optimized: Large shuffles, caching, ML training
# Compute-optimized: CPU-intensive transformations
# Storage-optimized: Large Delta Lake reads/writes
# GPU-accelerated: Deep learning, NLP models

Spark Configuration

Python - Common Spark Configurations
# Key Spark configurations for cluster setup
spark_conf = {
    # Adaptive Query Execution (enabled by default in DBR 14+)
    "spark.sql.adaptive.enabled": "true",

    # Dynamic partition pruning
    "spark.sql.optimizer.dynamicPartitionPruning.enabled": "true",

    # Shuffle partitions -- tune based on data size
    "spark.sql.shuffle.partitions": "auto",

    # Delta Lake optimizations
    "spark.databricks.delta.optimizeWrite.enabled": "true",
    "spark.databricks.delta.autoCompact.enabled": "true",
}

# Set configs at runtime in a notebook
spark.conf.set("spark.sql.shuffle.partitions", "200")

# Check current configuration
print(spark.conf.get("spark.sql.shuffle.partitions"))

Autoscaling

Autoscaling dynamically adjusts the number of worker nodes based on workload demand. This helps optimize cost while maintaining performance during peak loads.

How Autoscaling Works

Databricks monitors cluster utilization and adjusts the number of workers between your configured min and max values:

  • Scale Up: When pending tasks exceed available slots, new nodes are added within 2-5 minutes
  • Scale Down: When nodes are idle for a configurable period, they are removed gracefully
  • Optimized Autoscaling: Databricks-enhanced version that scales more aggressively than standard Spark autoscaling
Python - Autoscaling Configuration
# Create cluster with autoscaling enabled
autoscaling_cluster = {
    "cluster_name": "autoscaling-etl",
    "spark_version": "14.3.x-scala2.12",
    "node_type_id": "Standard_DS4_v2",
    "autoscale": {
        "min_workers": 2,
        "max_workers": 10
    },
    "autotermination_minutes": 30
}

# Monitor autoscaling activity
cluster_info = requests.get(
    f"{DATABRICKS_HOST}/api/2.0/clusters/get",
    headers=headers,
    params={"cluster_id": cluster_id}
).json()

print(f"Current workers: {cluster_info.get('num_workers', 'N/A')}")
print(f"State: {cluster_info['state']}")

Autoscaling Best Practices

Scenario Min Workers Max Workers Rationale
Interactive Development 1 4 Keep min low to save cost during idle time
ETL Pipeline 2 10 Ensure baseline capacity, scale for large loads
ML Training 4 4 Fixed size -- ML often needs consistent resources
Streaming 2 8 Handle traffic spikes while maintaining low latency

Spot vs On-Demand Instances

Cloud providers offer unused compute capacity at significant discounts (60-90% off) called spot instances (AWS), low-priority VMs (Azure), or preemptible VMs (GCP). The trade-off is that these can be reclaimed with short notice.

Cost Savings Strategy

A common production pattern: use on-demand instances for the driver node (critical -- losing the driver kills the job) and spot instances for worker nodes (Spark can recover from worker loss through task retry).

Python - Spot Instance Configuration
# Cluster with spot instances for workers
spot_cluster = {
    "cluster_name": "cost-optimized-etl",
    "spark_version": "14.3.x-scala2.12",
    "node_type_id": "Standard_DS4_v2",
    "num_workers": 4,

    # AWS: use spot instances
    "aws_attributes": {
        "first_on_demand": 1,  # Driver on-demand
        "availability": "SPOT_WITH_FALLBACK",
        "spot_bid_price_percent": 100,
        "zone_id": "auto"
    },

    # Azure: use spot VMs
    "azure_attributes": {
        "first_on_demand": 1,
        "availability": "SPOT_WITH_FALLBACK_AZURE",
        "spot_bid_max_price": -1
    }
}

# Availability options:
# SPOT             - Only spot (cheapest, risk of interruption)
# SPOT_WITH_FALLBACK - Try spot, fall back to on-demand
# ON_DEMAND        - Only on-demand (most reliable)

Cluster Lifecycle

Understanding cluster states and transitions is essential for managing costs and debugging issues. Every cluster moves through a well-defined set of states.

Cluster Lifecycle States
PENDING VMs ready RUNNING autoscale RESIZING stop / timeout TERMINATING TERMINATED restart RESTARTING restart cmd ERROR provisioning failure Auto-termination: Cluster moves to TERMINATING after idle for configured minutes (default: 120 min)

Managing Cluster Lifecycle

Python - Cluster Lifecycle Management
# Start a terminated cluster
requests.post(
    f"{DATABRICKS_HOST}/api/2.0/clusters/start",
    headers=headers,
    json={"cluster_id": cluster_id}
)

# Terminate a running cluster (graceful shutdown)
requests.post(
    f"{DATABRICKS_HOST}/api/2.0/clusters/delete",
    headers=headers,
    json={"cluster_id": cluster_id}
)

# Restart a cluster (useful after config changes)
requests.post(
    f"{DATABRICKS_HOST}/api/2.0/clusters/restart",
    headers=headers,
    json={"cluster_id": cluster_id}
)

# Poll cluster state until running
import time

def wait_for_cluster(cluster_id, target="RUNNING", timeout=600):
    start = time.time()
    while time.time() - start < timeout:
        info = requests.get(
            f"{DATABRICKS_HOST}/api/2.0/clusters/get",
            headers=headers,
            params={"cluster_id": cluster_id}
        ).json()
        state = info["state"]
        print(f"State: {state}")
        if state == target:
            return info
        if state in ["TERMINATED", "ERROR"]:
            raise Exception(f"Cluster entered {state}")
        time.sleep(15)

Cluster Policies

Cluster policies let admins control what users can configure, enforcing cost guardrails and standardization across teams.

JSON - Cluster Policy Example
{
    "name": "Cost-Controlled Development",
    "definition": {
        "autotermination_minutes": {
            "type": "range",
            "maxValue": 60,
            "defaultValue": 30
        },
        "num_workers": {
            "type": "range",
            "maxValue": 8,
            "defaultValue": 2
        },
        "node_type_id": {
            "type": "allowlist",
            "values": ["Standard_DS3_v2", "Standard_DS4_v2"]
        }
    }
}

Practice Problems

Problem 1: Cluster Type Selection

Easy

Your team has three workloads: (1) data scientists running ad-hoc queries throughout the day, (2) a nightly ETL pipeline that processes 500GB of data, and (3) a weekly ML training job. What cluster configuration would you recommend for each?

Problem 2: Cost Optimization

Easy

Your Databricks bill is $15,000/month. All-purpose clusters run 24/7 and account for 70% of the cost. What changes would you make to reduce the bill by at least 40%?

Problem 3: Cluster Policy Design

Medium

Write a cluster policy that limits max workers to 10, requires auto-termination between 15-60 minutes, restricts to only two node types, and forces Photon runtime. How would you apply this to your data engineering team?

Quick Reference

Concept Description Key Point
All-Purpose Cluster Interactive, multi-user compute Higher DBU rate, persists until terminated
Job Cluster Ephemeral, single-job compute Lower DBU rate, auto-terminates after job
Autoscaling Dynamic worker count adjustment Set min/max workers for cost and performance
Spot Instances Discounted cloud compute (60-90% off) Use for workers; keep driver on-demand
Cluster Policies Admin-defined configuration guardrails Enforce cost limits and standardization
Auto-Termination Automatic shutdown after idle period Set to 30 min for dev, 0 for streaming
Photon Native C++ query engine Up to 12x faster for SQL/DataFrame workloads

Useful Resources