CI/CD & DevOps for Databricks

Hard 30 min read

Git Repos Integration

Why CI/CD for Databricks?

The Problem: Without CI/CD, teams copy notebooks between environments manually, lack version control, cannot test changes before production, and have no audit trail of what changed and when.

The Solution: Databricks integrates natively with Git providers and offers Databricks Asset Bundles (DABs) for defining infrastructure-as-code. Combined with GitHub Actions and Terraform, you get a fully automated deployment pipeline.

Real Impact: Mature data teams deploy to production multiple times per day with confidence, using the same CI/CD practices that software engineering teams have relied on for decades.

Databricks Repos

Databricks Repos provides native Git integration directly in the workspace. You can clone repositories, create branches, commit changes, and push -- all from the Databricks UI or CLI.

Databricks CLI - Working with Git Repos
# Clone a repository into your workspace
databricks repos create \
  --url https://github.com/myorg/data-pipelines.git \
  --provider gitHub \
  --path /Repos/production/data-pipelines

# Update a repo to the latest commit on a branch
databricks repos update \
  --repo-id 12345 \
  --branch main

# List all repos in the workspace
databricks repos list

# Check repo status
databricks repos get --repo-id 12345

Repo Structure Best Practices

  • Keep notebooks, Python modules, and config files in a single repo
  • Use /Repos/production/ for the main branch (read-only for most users)
  • Use /Repos/<username>/ for personal development branches
  • Store shared libraries as Python packages installable via %pip install
  • Never store secrets or credentials in Git -- use Databricks Secrets

Databricks Asset Bundles (DABs)

Databricks Asset Bundles are the recommended way to define, test, and deploy Databricks resources as code. A bundle is a collection of configuration files that describe your jobs, pipelines, notebooks, and infrastructure.

YAML - databricks.yml Bundle Configuration
# databricks.yml -- root bundle config
bundle:
  name: ecommerce-etl-pipeline

include:
  - resources/*.yml

workspace:
  host: https://myworkspace.cloud.databricks.com

targets:
  dev:
    mode: development
    default: true
    workspace:
      root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/dev

  staging:
    mode: production
    workspace:
      host: https://staging-workspace.cloud.databricks.com
      root_path: /Shared/.bundle/${bundle.name}/staging
    variables:
      catalog: staging_catalog
      warehouse_size: Small

  prod:
    mode: production
    workspace:
      host: https://prod-workspace.cloud.databricks.com
      root_path: /Shared/.bundle/${bundle.name}/prod
    variables:
      catalog: prod_catalog
      warehouse_size: Large
    run_as:
      service_principal_name: prod-deployer-sp
YAML - resources/etl_job.yml
resources:
  jobs:
    ecommerce_etl:
      name: "E-Commerce ETL Pipeline - ${bundle.target}"
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"
        timezone_id: "UTC"
      job_clusters:
        - job_cluster_key: etl_cluster
          new_cluster:
            spark_version: "14.3.x-scala2.12"
            num_workers: 4
            node_type_id: "i3.xlarge"
      tasks:
        - task_key: ingest_raw
          notebook_task:
            notebook_path: ../src/ingest_raw.py
            base_parameters:
              catalog: "${var.catalog}"
          job_cluster_key: etl_cluster
        - task_key: transform_silver
          depends_on:
            - task_key: ingest_raw
          notebook_task:
            notebook_path: ../src/transform_silver.py
          job_cluster_key: etl_cluster
        - task_key: build_gold
          depends_on:
            - task_key: transform_silver
          notebook_task:
            notebook_path: ../src/build_gold.py
          job_cluster_key: etl_cluster
Databricks CLI - Bundle Commands
# Validate the bundle configuration
databricks bundle validate

# Deploy to dev (default target)
databricks bundle deploy

# Deploy to staging
databricks bundle deploy --target staging

# Run a specific job in the bundle
databricks bundle run ecommerce_etl --target staging

# Destroy resources (remove deployed jobs/pipelines)
databricks bundle destroy --target dev
CI/CD Pipeline Stages for Databricks
DEVELOP BUILD & TEST STAGING VALIDATE PRODUCTION Git commit/push PR created Code review Lint & format Unit tests Bundle validate Integration tests Bundle deploy Run test jobs Data validation Quality gates Manual approval Performance check Bundle deploy Smoke tests Monitor alerts git push PR trigger merge to main auto + manual approved release

Testing Strategies

Testing Databricks code requires a layered approach. Pure logic can be unit tested locally, while integration tests run against a live workspace.

Python - Unit Testing PySpark Transforms (pytest)
# tests/test_transforms.py
import pytest
from pyspark.sql import SparkSession
from src.transforms import clean_orders, calculate_rfm_features

@pytest.fixture(scope="session")
def spark():
    return (
        SparkSession.builder
        .master("local[2]")
        .appName("unit-tests")
        .getOrCreate()
    )

def test_clean_orders_removes_nulls(spark):
    """Test that clean_orders drops rows with null order_id."""
    data = [
        (1, "ORD-001", 99.99),
        (2, None, 49.99),
        (3, "ORD-003", 149.99),
    ]
    df = spark.createDataFrame(data, ["id", "order_id", "amount"])

    result = clean_orders(df)

    assert result.count() == 2
    assert result.filter("order_id IS NULL").count() == 0

def test_rfm_features_calculation(spark):
    """Test RFM feature computation logic."""
    orders = spark.createDataFrame([
        ("C1", "2024-03-01", 100.0),
        ("C1", "2024-03-15", 200.0),
        ("C2", "2024-02-01", 50.0),
    ], ["customer_id", "order_date", "amount"])

    result = calculate_rfm_features(orders).collect()

    c1 = [r for r in result if r.customer_id == "C1"][0]
    assert c1.total_orders == 2
    assert c1.avg_order_value == 150.0
Python - Integration Test Using Databricks Connect
# tests/integration/test_pipeline.py
from databricks.connect import DatabricksSession

def test_silver_table_freshness():
    """Verify silver table has data from today."""
    spark = DatabricksSession.builder.getOrCreate()

    result = spark.sql("""
        SELECT MAX(processed_date) as latest_date
        FROM staging_catalog.silver.orders
    """).collect()

    latest = result[0].latest_date
    assert latest is not None, "Silver table has no data"

def test_gold_table_row_count():
    """Gold table should have reasonable row counts."""
    spark = DatabricksSession.builder.getOrCreate()

    count = spark.table("staging_catalog.gold.daily_revenue").count()
    assert count > 0, "Gold table is empty"
    assert count < 1_000_000, f"Unexpected row count: {count}"

GitHub Actions Pipeline

GitHub Actions automates the entire CI/CD flow. On pull request, it runs tests and validates bundles. On merge to main, it deploys to staging and production.

YAML - .github/workflows/databricks-cicd.yml
name: Databricks CI/CD Pipeline

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

permissions:
  contents: read
  id-token: write

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - name: Install dependencies
        run: |
          pip install -r requirements-dev.txt
          pip install databricks-cli
      - name: Lint and format check
        run: |
          ruff check src/ tests/
          black --check src/ tests/
      - name: Run unit tests
        run: pytest tests/unit/ -v --tb=short
      - name: Validate bundle
        run: databricks bundle validate
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}

  deploy-staging:
    needs: test
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to staging
        run: databricks bundle deploy --target staging
        env:
          DATABRICKS_HOST: ${{ secrets.STAGING_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.STAGING_TOKEN }}
      - name: Run integration tests
        run: databricks bundle run integration_test_job --target staging

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to production
        run: databricks bundle deploy --target prod
        env:
          DATABRICKS_HOST: ${{ secrets.PROD_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.PROD_TOKEN }}

Terraform for Infrastructure

While DABs handle job and pipeline definitions, Terraform manages the underlying workspace infrastructure: clusters, cluster policies, instance pools, secrets, and Unity Catalog resources.

HCL - Terraform Databricks Workspace Configuration
# main.tf
terraform {
  required_providers {
    databricks = {
      source  = "databricks/databricks"
      version = "~> 1.40"
    }
  }
}

provider "databricks" {
  host  = var.databricks_host
  token = var.databricks_token
}

# Cluster policy for cost control
resource "databricks_cluster_policy" "etl_policy" {
  name = "ETL Cluster Policy"
  definition = jsonencode({
    "spark_version" : { "type" : "fixed", "value" : "14.3.x-scala2.12" },
    "num_workers" : { "type" : "range", "maxValue" : 10 },
    "autotermination_minutes" : { "type" : "fixed", "value" : 30 },
    "custom_tags.Environment" : { "type" : "fixed", "value" : var.environment }
  })
}

# Instance pool for faster cluster startup
resource "databricks_instance_pool" "etl_pool" {
  instance_pool_name = "etl-instance-pool-${var.environment}"
  min_idle_instances = 0
  max_capacity       = 20
  node_type_id       = "i3.xlarge"
  idle_instance_autotermination_minutes = 10
}

# Secret scope for credentials
resource "databricks_secret_scope" "app_secrets" {
  name = "app-secrets-${var.environment}"
}

resource "databricks_secret" "api_key" {
  scope        = databricks_secret_scope.app_secrets.name
  key          = "external-api-key"
  string_value = var.external_api_key
}

Environment Promotion Strategy

A mature Databricks deployment uses separate workspaces (or at minimum separate catalogs) for dev, staging, and production, with code flowing through each environment via the CI/CD pipeline.

Environment Purpose Data Access
Dev Feature development, experimentation Sample data or dev catalog All developers
Staging Integration testing, validation Production-like data (anonymized) CI/CD pipeline + leads
Production Live workloads serving the business Real production data Service principals only

Promotion Rules

  • Dev to Staging: Automatic on merge to main (all tests pass)
  • Staging to Production: Requires manual approval + all staging tests green
  • Rollback: Redeploy previous bundle version (Git tag-based)
  • Hotfix: Branch from production tag, deploy directly after emergency review
  • No human access to prod: All production deployments via service principals

Practice Problems

Problem 1: Design a CI/CD Pipeline

Medium

Your team of 8 data engineers works on a shared repository with 15 ETL jobs and 3 DLT pipelines. Design a CI/CD pipeline that prevents broken code from reaching production while keeping deployment velocity high (target: deploy to production within 30 minutes of PR merge).

Problem 2: Testing a Streaming Pipeline

Hard

You have a Structured Streaming job that reads from Kafka, transforms events, and writes to Delta. How do you write automated tests for this? Consider unit tests for transform logic, integration tests for the full pipeline, and how to handle the streaming nature of the data.

Problem 3: Multi-Workspace Governance

Hard

Your organization has 3 Databricks workspaces (dev, staging, prod) across AWS. Design the Terraform configuration and CI/CD pipeline that manages Unity Catalog, cluster policies, and secret scopes consistently across all environments. How do you handle environment-specific differences?

Quick Reference

Tool Purpose Key Command / Config
Databricks ReposGit integration in workspacedatabricks repos create/update
Asset BundlesInfrastructure-as-code for jobs/pipelinesdatabricks bundle deploy --target prod
pytest + PySparkUnit testing transformsLocal SparkSession with master("local[2]")
Databricks ConnectIntegration testing against live workspaceDatabricksSession.builder.getOrCreate()
GitHub ActionsCI/CD automationWorkflows triggered on PR and push
TerraformWorkspace infrastructure managementterraform apply -var-file=prod.tfvars
Service PrincipalsNon-human identity for prod deploysOAuth or PAT-based authentication

Useful Resources