CI/CD & DevOps for Databricks

Git Repos Integration

Why CI/CD for Databricks?

The Problem: Without CI/CD, teams copy notebooks between environments manually, lack version control, cannot test changes before production, and have no audit trail of what changed and when.

The Solution: Databricks integrates natively with Git providers and offers Databricks Asset Bundles (DABs) for defining infrastructure-as-code. Combined with GitHub Actions and Terraform, you get a fully automated deployment pipeline.

Real Impact: Mature data teams deploy to production multiple times per day with confidence, using the same CI/CD practices that software engineering teams have relied on for decades.

Databricks Repos

Databricks Repos provides native Git integration directly in the workspace. You can clone repositories, create branches, commit changes, and push -- all from the Databricks UI or CLI.

Databricks CLI - Working with Git Repos

# Clone a repository into your workspace
databricks repos create \
  --url https://github.com/myorg/data-pipelines.git \
  --provider gitHub \
  --path /Repos/production/data-pipelines

# Update a repo to the latest commit on a branch
databricks repos update \
  --repo-id 12345 \
  --branch main

# List all repos in the workspace
databricks repos list

# Check repo status
databricks repos get --repo-id 12345

Repo Structure Best Practices

Keep notebooks, Python modules, and config files in a single repo
Use /Repos/production/ for the main branch (read-only for most users)
Use /Repos/<username>/ for personal development branches
Store shared libraries as Python packages installable via %pip install
Never store secrets or credentials in Git -- use Databricks Secrets

Databricks Asset Bundles (DABs)

Databricks Asset Bundles are the recommended way to define, test, and deploy Databricks resources as code. A bundle is a collection of configuration files that describe your jobs, pipelines, notebooks, and infrastructure.

YAML - databricks.yml Bundle Configuration

# databricks.yml -- root bundle config
bundle:
  name: ecommerce-etl-pipeline

include:
  - resources/*.yml

workspace:
  host: https://myworkspace.cloud.databricks.com

targets:
  dev:
    mode: development
    default: true
    workspace:
      root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/dev

  staging:
    mode: production
    workspace:
      host: https://staging-workspace.cloud.databricks.com
      root_path: /Shared/.bundle/${bundle.name}/staging
    variables:
      catalog: staging_catalog
      warehouse_size: Small

  prod:
    mode: production
    workspace:
      host: https://prod-workspace.cloud.databricks.com
      root_path: /Shared/.bundle/${bundle.name}/prod
    variables:
      catalog: prod_catalog
      warehouse_size: Large
    run_as:
      service_principal_name: prod-deployer-sp

YAML - resources/etl_job.yml

resources:
  jobs:
    ecommerce_etl:
      name: "E-Commerce ETL Pipeline - ${bundle.target}"
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"
        timezone_id: "UTC"
      job_clusters:
        - job_cluster_key: etl_cluster
          new_cluster:
            spark_version: "14.3.x-scala2.12"
            num_workers: 4
            node_type_id: "i3.xlarge"
      tasks:
        - task_key: ingest_raw
          notebook_task:
            notebook_path: ../src/ingest_raw.py
            base_parameters:
              catalog: "${var.catalog}"
          job_cluster_key: etl_cluster
        - task_key: transform_silver
          depends_on:
            - task_key: ingest_raw
          notebook_task:
            notebook_path: ../src/transform_silver.py
          job_cluster_key: etl_cluster
        - task_key: build_gold
          depends_on:
            - task_key: transform_silver
          notebook_task:
            notebook_path: ../src/build_gold.py
          job_cluster_key: etl_cluster

Databricks CLI - Bundle Commands

# Validate the bundle configuration
databricks bundle validate

# Deploy to dev (default target)
databricks bundle deploy

# Deploy to staging
databricks bundle deploy --target staging

# Run a specific job in the bundle
databricks bundle run ecommerce_etl --target staging

# Destroy resources (remove deployed jobs/pipelines)
databricks bundle destroy --target dev

CI/CD Pipeline Stages for Databricks

Testing Strategies

Testing Databricks code requires a layered approach. Pure logic can be unit tested locally, while integration tests run against a live workspace.

Python - Unit Testing PySpark Transforms (pytest)

# tests/test_transforms.py
import pytest
from pyspark.sql import SparkSession
from src.transforms import clean_orders, calculate_rfm_features

@pytest.fixture(scope="session")
def spark():
    return (
        SparkSession.builder
        .master("local[2]")
        .appName("unit-tests")
        .getOrCreate()
    )

def test_clean_orders_removes_nulls(spark):
    """Test that clean_orders drops rows with null order_id."""
    data = [
        (1, "ORD-001", 99.99),
        (2, None, 49.99),
        (3, "ORD-003", 149.99),
    ]
    df = spark.createDataFrame(data, ["id", "order_id", "amount"])

    result = clean_orders(df)

    assert result.count() == 2
    assert result.filter("order_id IS NULL").count() == 0

def test_rfm_features_calculation(spark):
    """Test RFM feature computation logic."""
    orders = spark.createDataFrame([
        ("C1", "2024-03-01", 100.0),
        ("C1", "2024-03-15", 200.0),
        ("C2", "2024-02-01", 50.0),
    ], ["customer_id", "order_date", "amount"])

    result = calculate_rfm_features(orders).collect()

    c1 = [r for r in result if r.customer_id == "C1"][0]
    assert c1.total_orders == 2
    assert c1.avg_order_value == 150.0

Python - Integration Test Using Databricks Connect

# tests/integration/test_pipeline.py
from databricks.connect import DatabricksSession

def test_silver_table_freshness():
    """Verify silver table has data from today."""
    spark = DatabricksSession.builder.getOrCreate()

    result = spark.sql("""
        SELECT MAX(processed_date) as latest_date
        FROM staging_catalog.silver.orders
    """).collect()

    latest = result[0].latest_date
    assert latest is not None, "Silver table has no data"

def test_gold_table_row_count():
    """Gold table should have reasonable row counts."""
    spark = DatabricksSession.builder.getOrCreate()

    count = spark.table("staging_catalog.gold.daily_revenue").count()
    assert count > 0, "Gold table is empty"
    assert count < 1_000_000, f"Unexpected row count: {count}"

GitHub Actions Pipeline

GitHub Actions automates the entire CI/CD flow. On pull request, it runs tests and validates bundles. On merge to main, it deploys to staging and production.

YAML - .github/workflows/databricks-cicd.yml

name: Databricks CI/CD Pipeline

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

permissions:
  contents: read
  id-token: write

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - name: Install dependencies
        run: |
          pip install -r requirements-dev.txt
          pip install databricks-cli
      - name: Lint and format check
        run: |
          ruff check src/ tests/
          black --check src/ tests/
      - name: Run unit tests
        run: pytest tests/unit/ -v --tb=short
      - name: Validate bundle
        run: databricks bundle validate
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}

  deploy-staging:
    needs: test
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to staging
        run: databricks bundle deploy --target staging
        env:
          DATABRICKS_HOST: ${{ secrets.STAGING_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.STAGING_TOKEN }}
      - name: Run integration tests
        run: databricks bundle run integration_test_job --target staging

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to production
        run: databricks bundle deploy --target prod
        env:
          DATABRICKS_HOST: ${{ secrets.PROD_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.PROD_TOKEN }}

Terraform for Infrastructure

While DABs handle job and pipeline definitions, Terraform manages the underlying workspace infrastructure: clusters, cluster policies, instance pools, secrets, and Unity Catalog resources.

HCL - Terraform Databricks Workspace Configuration

# main.tf
terraform {
  required_providers {
    databricks = {
      source  = "databricks/databricks"
      version = "~> 1.40"
    }
  }
}

provider "databricks" {
  host  = var.databricks_host
  token = var.databricks_token
}

# Cluster policy for cost control
resource "databricks_cluster_policy" "etl_policy" {
  name = "ETL Cluster Policy"
  definition = jsonencode({
    "spark_version" : { "type" : "fixed", "value" : "14.3.x-scala2.12" },
    "num_workers" : { "type" : "range", "maxValue" : 10 },
    "autotermination_minutes" : { "type" : "fixed", "value" : 30 },
    "custom_tags.Environment" : { "type" : "fixed", "value" : var.environment }
  })
}

# Instance pool for faster cluster startup
resource "databricks_instance_pool" "etl_pool" {
  instance_pool_name = "etl-instance-pool-${var.environment}"
  min_idle_instances = 0
  max_capacity       = 20
  node_type_id       = "i3.xlarge"
  idle_instance_autotermination_minutes = 10
}

# Secret scope for credentials
resource "databricks_secret_scope" "app_secrets" {
  name = "app-secrets-${var.environment}"
}

resource "databricks_secret" "api_key" {
  scope        = databricks_secret_scope.app_secrets.name
  key          = "external-api-key"
  string_value = var.external_api_key
}

Environment Promotion Strategy

A mature Databricks deployment uses separate workspaces (or at minimum separate catalogs) for dev, staging, and production, with code flowing through each environment via the CI/CD pipeline.

Environment	Purpose	Data	Access
Dev	Feature development, experimentation	Sample data or dev catalog	All developers
Staging	Integration testing, validation	Production-like data (anonymized)	CI/CD pipeline + leads
Production	Live workloads serving the business	Real production data	Service principals only

Promotion Rules

Dev to Staging: Automatic on merge to main (all tests pass)
Staging to Production: Requires manual approval + all staging tests green
Rollback: Redeploy previous bundle version (Git tag-based)
Hotfix: Branch from production tag, deploy directly after emergency review
No human access to prod: All production deployments via service principals

Practice Problems

Problem 1: Design a CI/CD Pipeline

Medium

Your team of 8 data engineers works on a shared repository with 15 ETL jobs and 3 DLT pipelines. Design a CI/CD pipeline that prevents broken code from reaching production while keeping deployment velocity high (target: deploy to production within 30 minutes of PR merge).

Problem 2: Testing a Streaming Pipeline

Hard

You have a Structured Streaming job that reads from Kafka, transforms events, and writes to Delta. How do you write automated tests for this? Consider unit tests for transform logic, integration tests for the full pipeline, and how to handle the streaming nature of the data.

Problem 3: Multi-Workspace Governance

Hard

Your organization has 3 Databricks workspaces (dev, staging, prod) across AWS. Design the Terraform configuration and CI/CD pipeline that manages Unity Catalog, cluster policies, and secret scopes consistently across all environments. How do you handle environment-specific differences?

Quick Reference

Tool	Purpose	Key Command / Config
Databricks Repos	Git integration in workspace	`databricks repos create/update`
Asset Bundles	Infrastructure-as-code for jobs/pipelines	`databricks bundle deploy --target prod`
pytest + PySpark	Unit testing transforms	Local SparkSession with `master("local[2]")`
Databricks Connect	Integration testing against live workspace	`DatabricksSession.builder.getOrCreate()`
GitHub Actions	CI/CD automation	Workflows triggered on PR and push
Terraform	Workspace infrastructure management	`terraform apply -var-file=prod.tfvars`
Service Principals	Non-human identity for prod deploys	OAuth or PAT-based authentication

CI/CD & DevOps for Databricks

Git Repos Integration

Why CI/CD for Databricks?

Databricks Repos

Repo Structure Best Practices

Databricks Asset Bundles (DABs)

Testing Strategies

GitHub Actions Pipeline

Terraform for Infrastructure

Environment Promotion Strategy

Promotion Rules

Practice Problems

Problem 1: Design a CI/CD Pipeline

Problem 2: Testing a Streaming Pipeline

Problem 3: Multi-Workspace Governance

Quick Reference

Useful Resources

Recommended Reading

Git Repos Integration

Why CI/CD for Databricks?

Databricks Repos

Repo Structure Best Practices

Databricks Asset Bundles (DABs)

Testing Strategies

GitHub Actions Pipeline

Terraform for Infrastructure

Environment Promotion Strategy

Promotion Rules

Practice Problems

Problem 1: Design a CI/CD Pipeline

Problem 2: Testing a Streaming Pipeline

Problem 3: Multi-Workspace Governance

Quick Reference

Useful Resources

Recommended Reading

Related Topics