Clusters & Compute Management

All-Purpose vs Job Clusters

Why Cluster Types Matter

The Problem: Running all workloads on a single cluster type wastes money and creates resource contention between interactive development and production jobs.

The Solution: Databricks provides two cluster types optimized for different use cases -- all-purpose clusters for interactive work and job clusters for automated production workloads.

Cost Impact: Using job clusters instead of all-purpose clusters for production workloads can reduce compute costs by 30-60% through automatic termination and right-sizing.

Real-World Analogy

Think of clusters like vehicles:

All-Purpose Cluster = Your personal car -- always available, you control when it starts and stops, shared with passengers (collaborators)
Job Cluster = A rental car -- created for a specific trip (job), returned (terminated) automatically when the trip is done

Feature	All-Purpose Cluster	Job Cluster
Use Case	Interactive development, ad-hoc analysis	Scheduled production jobs
Creation	Manually via UI/API	Automatically by job scheduler
Lifecycle	Persists until manually terminated	Created at job start, terminated at job end
Multi-User	Multiple users can attach notebooks	Dedicated to a single job run
Cost	Higher (runs even when idle)	Lower (pay only for job duration)
DBU Rate	All-Purpose Compute rate	Jobs Compute rate (typically 2-3x cheaper)

Python - Creating Clusters via API

import requests
import json

DATABRICKS_HOST = "https://adb-1234567890.12.azuredatabricks.net"
TOKEN = "dapi_your_token"
headers = {"Authorization": f"Bearer {TOKEN}"}

# Create an All-Purpose Cluster
all_purpose_config = {
    "cluster_name": "dev-interactive",
    "spark_version": "14.3.x-scala2.12",
    "node_type_id": "Standard_DS3_v2",
    "num_workers": 2,
    "autotermination_minutes": 30,
    "spark_conf": {
        "spark.databricks.cluster.profile": "serverless"
    }
}

response = requests.post(
    f"{DATABRICKS_HOST}/api/2.0/clusters/create",
    headers=headers,
    json=all_purpose_config
)
cluster_id = response.json()["cluster_id"]
print(f"Created cluster: {cluster_id}")

# Job clusters are defined within job configurations
job_cluster_config = {
    "new_cluster": {
        "spark_version": "14.3.x-scala2.12",
        "node_type_id": "Standard_DS3_v2",
        "num_workers": 4,
    }
}

Cluster Configuration

Configuring a cluster correctly is critical for performance and cost optimization. Here are the key parameters you need to understand.

Databricks Runtime Versions

Every cluster runs a Databricks Runtime (DBR) which bundles Apache Spark, Delta Lake, and other libraries. Choose the right runtime for your workload:

Standard Runtime

General-purpose runtime with Spark, Delta Lake, and common libraries. Best for data engineering and general analytics workloads.

ML Runtime

Includes popular ML frameworks (PyTorch, TensorFlow, scikit-learn, XGBoost) pre-installed. Saves time setting up ML environments.

Photon Runtime

Native C++ query engine that accelerates SQL and DataFrame operations by up to 12x. Ideal for SQL-heavy and ETL workloads.

GPU Runtime

Includes GPU drivers and CUDA toolkit for deep learning training and inference. Required for GPU-accelerated clusters.

Node Types and Sizing

Python - List Available Node Types

# List available node types for your cloud
response = requests.get(
    f"{DATABRICKS_HOST}/api/2.0/clusters/list-node-types",
    headers=headers
)

node_types = response.json()["node_types"]
for nt in node_types[:5]:
    print(f"{nt['node_type_id']}: {nt['memory_mb']}MB, {nt['num_cores']} cores")

# Common sizing guidelines:
# Memory-optimized: Large shuffles, caching, ML training
# Compute-optimized: CPU-intensive transformations
# Storage-optimized: Large Delta Lake reads/writes
# GPU-accelerated: Deep learning, NLP models

Spark Configuration

Python - Common Spark Configurations

# Key Spark configurations for cluster setup
spark_conf = {
    # Adaptive Query Execution (enabled by default in DBR 14+)
    "spark.sql.adaptive.enabled": "true",

    # Dynamic partition pruning
    "spark.sql.optimizer.dynamicPartitionPruning.enabled": "true",

    # Shuffle partitions -- tune based on data size
    "spark.sql.shuffle.partitions": "auto",

    # Delta Lake optimizations
    "spark.databricks.delta.optimizeWrite.enabled": "true",
    "spark.databricks.delta.autoCompact.enabled": "true",
}

# Set configs at runtime in a notebook
spark.conf.set("spark.sql.shuffle.partitions", "200")

# Check current configuration
print(spark.conf.get("spark.sql.shuffle.partitions"))

Autoscaling

Autoscaling dynamically adjusts the number of worker nodes based on workload demand. This helps optimize cost while maintaining performance during peak loads.

How Autoscaling Works

Databricks monitors cluster utilization and adjusts the number of workers between your configured min and max values:

Scale Up: When pending tasks exceed available slots, new nodes are added within 2-5 minutes
Scale Down: When nodes are idle for a configurable period, they are removed gracefully
Optimized Autoscaling: Databricks-enhanced version that scales more aggressively than standard Spark autoscaling

Python - Autoscaling Configuration

# Create cluster with autoscaling enabled
autoscaling_cluster = {
    "cluster_name": "autoscaling-etl",
    "spark_version": "14.3.x-scala2.12",
    "node_type_id": "Standard_DS4_v2",
    "autoscale": {
        "min_workers": 2,
        "max_workers": 10
    },
    "autotermination_minutes": 30
}

# Monitor autoscaling activity
cluster_info = requests.get(
    f"{DATABRICKS_HOST}/api/2.0/clusters/get",
    headers=headers,
    params={"cluster_id": cluster_id}
).json()

print(f"Current workers: {cluster_info.get('num_workers', 'N/A')}")
print(f"State: {cluster_info['state']}")

Autoscaling Best Practices

Scenario	Min Workers	Max Workers	Rationale
Interactive Development	1	4	Keep min low to save cost during idle time
ETL Pipeline	2	10	Ensure baseline capacity, scale for large loads
ML Training	4	4	Fixed size -- ML often needs consistent resources
Streaming	2	8	Handle traffic spikes while maintaining low latency

Spot vs On-Demand Instances

Cloud providers offer unused compute capacity at significant discounts (60-90% off) called spot instances (AWS), low-priority VMs (Azure), or preemptible VMs (GCP). The trade-off is that these can be reclaimed with short notice.

Cost Savings Strategy

A common production pattern: use on-demand instances for the driver node (critical -- losing the driver kills the job) and spot instances for worker nodes (Spark can recover from worker loss through task retry).

Python - Spot Instance Configuration

# Cluster with spot instances for workers
spot_cluster = {
    "cluster_name": "cost-optimized-etl",
    "spark_version": "14.3.x-scala2.12",
    "node_type_id": "Standard_DS4_v2",
    "num_workers": 4,

    # AWS: use spot instances
    "aws_attributes": {
        "first_on_demand": 1,  # Driver on-demand
        "availability": "SPOT_WITH_FALLBACK",
        "spot_bid_price_percent": 100,
        "zone_id": "auto"
    },

    # Azure: use spot VMs
    "azure_attributes": {
        "first_on_demand": 1,
        "availability": "SPOT_WITH_FALLBACK_AZURE",
        "spot_bid_max_price": -1
    }
}

# Availability options:
# SPOT             - Only spot (cheapest, risk of interruption)
# SPOT_WITH_FALLBACK - Try spot, fall back to on-demand
# ON_DEMAND        - Only on-demand (most reliable)

Cluster Lifecycle

Understanding cluster states and transitions is essential for managing costs and debugging issues. Every cluster moves through a well-defined set of states.

Cluster Lifecycle States

Managing Cluster Lifecycle

Python - Cluster Lifecycle Management

# Start a terminated cluster
requests.post(
    f"{DATABRICKS_HOST}/api/2.0/clusters/start",
    headers=headers,
    json={"cluster_id": cluster_id}
)

# Terminate a running cluster (graceful shutdown)
requests.post(
    f"{DATABRICKS_HOST}/api/2.0/clusters/delete",
    headers=headers,
    json={"cluster_id": cluster_id}
)

# Restart a cluster (useful after config changes)
requests.post(
    f"{DATABRICKS_HOST}/api/2.0/clusters/restart",
    headers=headers,
    json={"cluster_id": cluster_id}
)

# Poll cluster state until running
import time

def wait_for_cluster(cluster_id, target="RUNNING", timeout=600):
    start = time.time()
    while time.time() - start < timeout:
        info = requests.get(
            f"{DATABRICKS_HOST}/api/2.0/clusters/get",
            headers=headers,
            params={"cluster_id": cluster_id}
        ).json()
        state = info["state"]
        print(f"State: {state}")
        if state == target:
            return info
        if state in ["TERMINATED", "ERROR"]:
            raise Exception(f"Cluster entered {state}")
        time.sleep(15)

Cluster Policies

Cluster policies let admins control what users can configure, enforcing cost guardrails and standardization across teams.

JSON - Cluster Policy Example

{
    "name": "Cost-Controlled Development",
    "definition": {
        "autotermination_minutes": {
            "type": "range",
            "maxValue": 60,
            "defaultValue": 30
        },
        "num_workers": {
            "type": "range",
            "maxValue": 8,
            "defaultValue": 2
        },
        "node_type_id": {
            "type": "allowlist",
            "values": ["Standard_DS3_v2", "Standard_DS4_v2"]
        }
    }
}

Practice Problems

Problem 1: Cluster Type Selection

Easy

Your team has three workloads: (1) data scientists running ad-hoc queries throughout the day, (2) a nightly ETL pipeline that processes 500GB of data, and (3) a weekly ML training job. What cluster configuration would you recommend for each?

Problem 2: Cost Optimization

Easy

Your Databricks bill is $15,000/month. All-purpose clusters run 24/7 and account for 70% of the cost. What changes would you make to reduce the bill by at least 40%?

Problem 3: Cluster Policy Design

Medium

Write a cluster policy that limits max workers to 10, requires auto-termination between 15-60 minutes, restricts to only two node types, and forces Photon runtime. How would you apply this to your data engineering team?

Quick Reference

Concept	Description	Key Point
All-Purpose Cluster	Interactive, multi-user compute	Higher DBU rate, persists until terminated
Job Cluster	Ephemeral, single-job compute	Lower DBU rate, auto-terminates after job
Autoscaling	Dynamic worker count adjustment	Set min/max workers for cost and performance
Spot Instances	Discounted cloud compute (60-90% off)	Use for workers; keep driver on-demand
Cluster Policies	Admin-defined configuration guardrails	Enforce cost limits and standardization
Auto-Termination	Automatic shutdown after idle period	Set to 30 min for dev, 0 for streaming
Photon	Native C++ query engine	Up to 12x faster for SQL/DataFrame workloads

Clusters & Compute Management

All-Purpose vs Job Clusters

Why Cluster Types Matter

Real-World Analogy

Cluster Configuration

Databricks Runtime Versions

Standard Runtime

ML Runtime

Photon Runtime

GPU Runtime

Node Types and Sizing

Spark Configuration

Autoscaling

How Autoscaling Works

Autoscaling Best Practices

Spot vs On-Demand Instances

Cost Savings Strategy

Cluster Lifecycle

Managing Cluster Lifecycle

Cluster Policies

Practice Problems

Problem 1: Cluster Type Selection

Problem 2: Cost Optimization

Problem 3: Cluster Policy Design

Quick Reference

Useful Resources

Recommended Reading

All-Purpose vs Job Clusters

Why Cluster Types Matter

Real-World Analogy

Cluster Configuration

Databricks Runtime Versions

Standard Runtime

ML Runtime

Photon Runtime

GPU Runtime

Node Types and Sizing

Spark Configuration

Autoscaling

How Autoscaling Works

Autoscaling Best Practices

Spot vs On-Demand Instances

Cost Savings Strategy

Cluster Lifecycle

Managing Cluster Lifecycle

Cluster Policies

Practice Problems

Problem 1: Cluster Type Selection

Problem 2: Cost Optimization

Problem 3: Cluster Policy Design

Quick Reference

Useful Resources

Recommended Reading

Related Topics