Git Repos Integration
Why CI/CD for Databricks?
The Problem: Without CI/CD, teams copy notebooks between environments manually, lack version control, cannot test changes before production, and have no audit trail of what changed and when.
The Solution: Databricks integrates natively with Git providers and offers Databricks Asset Bundles (DABs) for defining infrastructure-as-code. Combined with GitHub Actions and Terraform, you get a fully automated deployment pipeline.
Real Impact: Mature data teams deploy to production multiple times per day with confidence, using the same CI/CD practices that software engineering teams have relied on for decades.
Databricks Repos
Databricks Repos provides native Git integration directly in the workspace. You can clone repositories, create branches, commit changes, and push -- all from the Databricks UI or CLI.
# Clone a repository into your workspace
databricks repos create \
--url https://github.com/myorg/data-pipelines.git \
--provider gitHub \
--path /Repos/production/data-pipelines
# Update a repo to the latest commit on a branch
databricks repos update \
--repo-id 12345 \
--branch main
# List all repos in the workspace
databricks repos list
# Check repo status
databricks repos get --repo-id 12345
Repo Structure Best Practices
- Keep notebooks, Python modules, and config files in a single repo
- Use
/Repos/production/for the main branch (read-only for most users) - Use
/Repos/<username>/for personal development branches - Store shared libraries as Python packages installable via
%pip install - Never store secrets or credentials in Git -- use Databricks Secrets
Databricks Asset Bundles (DABs)
Databricks Asset Bundles are the recommended way to define, test, and deploy Databricks resources as code. A bundle is a collection of configuration files that describe your jobs, pipelines, notebooks, and infrastructure.
# databricks.yml -- root bundle config
bundle:
name: ecommerce-etl-pipeline
include:
- resources/*.yml
workspace:
host: https://myworkspace.cloud.databricks.com
targets:
dev:
mode: development
default: true
workspace:
root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/dev
staging:
mode: production
workspace:
host: https://staging-workspace.cloud.databricks.com
root_path: /Shared/.bundle/${bundle.name}/staging
variables:
catalog: staging_catalog
warehouse_size: Small
prod:
mode: production
workspace:
host: https://prod-workspace.cloud.databricks.com
root_path: /Shared/.bundle/${bundle.name}/prod
variables:
catalog: prod_catalog
warehouse_size: Large
run_as:
service_principal_name: prod-deployer-sp
resources:
jobs:
ecommerce_etl:
name: "E-Commerce ETL Pipeline - ${bundle.target}"
schedule:
quartz_cron_expression: "0 0 6 * * ?"
timezone_id: "UTC"
job_clusters:
- job_cluster_key: etl_cluster
new_cluster:
spark_version: "14.3.x-scala2.12"
num_workers: 4
node_type_id: "i3.xlarge"
tasks:
- task_key: ingest_raw
notebook_task:
notebook_path: ../src/ingest_raw.py
base_parameters:
catalog: "${var.catalog}"
job_cluster_key: etl_cluster
- task_key: transform_silver
depends_on:
- task_key: ingest_raw
notebook_task:
notebook_path: ../src/transform_silver.py
job_cluster_key: etl_cluster
- task_key: build_gold
depends_on:
- task_key: transform_silver
notebook_task:
notebook_path: ../src/build_gold.py
job_cluster_key: etl_cluster
# Validate the bundle configuration
databricks bundle validate
# Deploy to dev (default target)
databricks bundle deploy
# Deploy to staging
databricks bundle deploy --target staging
# Run a specific job in the bundle
databricks bundle run ecommerce_etl --target staging
# Destroy resources (remove deployed jobs/pipelines)
databricks bundle destroy --target dev
Testing Strategies
Testing Databricks code requires a layered approach. Pure logic can be unit tested locally, while integration tests run against a live workspace.
# tests/test_transforms.py
import pytest
from pyspark.sql import SparkSession
from src.transforms import clean_orders, calculate_rfm_features
@pytest.fixture(scope="session")
def spark():
return (
SparkSession.builder
.master("local[2]")
.appName("unit-tests")
.getOrCreate()
)
def test_clean_orders_removes_nulls(spark):
"""Test that clean_orders drops rows with null order_id."""
data = [
(1, "ORD-001", 99.99),
(2, None, 49.99),
(3, "ORD-003", 149.99),
]
df = spark.createDataFrame(data, ["id", "order_id", "amount"])
result = clean_orders(df)
assert result.count() == 2
assert result.filter("order_id IS NULL").count() == 0
def test_rfm_features_calculation(spark):
"""Test RFM feature computation logic."""
orders = spark.createDataFrame([
("C1", "2024-03-01", 100.0),
("C1", "2024-03-15", 200.0),
("C2", "2024-02-01", 50.0),
], ["customer_id", "order_date", "amount"])
result = calculate_rfm_features(orders).collect()
c1 = [r for r in result if r.customer_id == "C1"][0]
assert c1.total_orders == 2
assert c1.avg_order_value == 150.0
# tests/integration/test_pipeline.py
from databricks.connect import DatabricksSession
def test_silver_table_freshness():
"""Verify silver table has data from today."""
spark = DatabricksSession.builder.getOrCreate()
result = spark.sql("""
SELECT MAX(processed_date) as latest_date
FROM staging_catalog.silver.orders
""").collect()
latest = result[0].latest_date
assert latest is not None, "Silver table has no data"
def test_gold_table_row_count():
"""Gold table should have reasonable row counts."""
spark = DatabricksSession.builder.getOrCreate()
count = spark.table("staging_catalog.gold.daily_revenue").count()
assert count > 0, "Gold table is empty"
assert count < 1_000_000, f"Unexpected row count: {count}"
GitHub Actions Pipeline
GitHub Actions automates the entire CI/CD flow. On pull request, it runs tests and validates bundles. On merge to main, it deploys to staging and production.
name: Databricks CI/CD Pipeline
on:
pull_request:
branches: [main]
push:
branches: [main]
permissions:
contents: read
id-token: write
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: |
pip install -r requirements-dev.txt
pip install databricks-cli
- name: Lint and format check
run: |
ruff check src/ tests/
black --check src/ tests/
- name: Run unit tests
run: pytest tests/unit/ -v --tb=short
- name: Validate bundle
run: databricks bundle validate
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
deploy-staging:
needs: test
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- name: Deploy to staging
run: databricks bundle deploy --target staging
env:
DATABRICKS_HOST: ${{ secrets.STAGING_HOST }}
DATABRICKS_TOKEN: ${{ secrets.STAGING_TOKEN }}
- name: Run integration tests
run: databricks bundle run integration_test_job --target staging
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Deploy to production
run: databricks bundle deploy --target prod
env:
DATABRICKS_HOST: ${{ secrets.PROD_HOST }}
DATABRICKS_TOKEN: ${{ secrets.PROD_TOKEN }}
Terraform for Infrastructure
While DABs handle job and pipeline definitions, Terraform manages the underlying workspace infrastructure: clusters, cluster policies, instance pools, secrets, and Unity Catalog resources.
# main.tf
terraform {
required_providers {
databricks = {
source = "databricks/databricks"
version = "~> 1.40"
}
}
}
provider "databricks" {
host = var.databricks_host
token = var.databricks_token
}
# Cluster policy for cost control
resource "databricks_cluster_policy" "etl_policy" {
name = "ETL Cluster Policy"
definition = jsonencode({
"spark_version" : { "type" : "fixed", "value" : "14.3.x-scala2.12" },
"num_workers" : { "type" : "range", "maxValue" : 10 },
"autotermination_minutes" : { "type" : "fixed", "value" : 30 },
"custom_tags.Environment" : { "type" : "fixed", "value" : var.environment }
})
}
# Instance pool for faster cluster startup
resource "databricks_instance_pool" "etl_pool" {
instance_pool_name = "etl-instance-pool-${var.environment}"
min_idle_instances = 0
max_capacity = 20
node_type_id = "i3.xlarge"
idle_instance_autotermination_minutes = 10
}
# Secret scope for credentials
resource "databricks_secret_scope" "app_secrets" {
name = "app-secrets-${var.environment}"
}
resource "databricks_secret" "api_key" {
scope = databricks_secret_scope.app_secrets.name
key = "external-api-key"
string_value = var.external_api_key
}
Environment Promotion Strategy
A mature Databricks deployment uses separate workspaces (or at minimum separate catalogs) for dev, staging, and production, with code flowing through each environment via the CI/CD pipeline.
| Environment | Purpose | Data | Access |
|---|---|---|---|
| Dev | Feature development, experimentation | Sample data or dev catalog | All developers |
| Staging | Integration testing, validation | Production-like data (anonymized) | CI/CD pipeline + leads |
| Production | Live workloads serving the business | Real production data | Service principals only |
Promotion Rules
- Dev to Staging: Automatic on merge to main (all tests pass)
- Staging to Production: Requires manual approval + all staging tests green
- Rollback: Redeploy previous bundle version (Git tag-based)
- Hotfix: Branch from production tag, deploy directly after emergency review
- No human access to prod: All production deployments via service principals
Practice Problems
Problem 1: Design a CI/CD Pipeline
MediumYour team of 8 data engineers works on a shared repository with 15 ETL jobs and 3 DLT pipelines. Design a CI/CD pipeline that prevents broken code from reaching production while keeping deployment velocity high (target: deploy to production within 30 minutes of PR merge).
Problem 2: Testing a Streaming Pipeline
HardYou have a Structured Streaming job that reads from Kafka, transforms events, and writes to Delta. How do you write automated tests for this? Consider unit tests for transform logic, integration tests for the full pipeline, and how to handle the streaming nature of the data.
Problem 3: Multi-Workspace Governance
HardYour organization has 3 Databricks workspaces (dev, staging, prod) across AWS. Design the Terraform configuration and CI/CD pipeline that manages Unity Catalog, cluster policies, and secret scopes consistently across all environments. How do you handle environment-specific differences?
Quick Reference
| Tool | Purpose | Key Command / Config |
|---|---|---|
| Databricks Repos | Git integration in workspace | databricks repos create/update |
| Asset Bundles | Infrastructure-as-code for jobs/pipelines | databricks bundle deploy --target prod |
| pytest + PySpark | Unit testing transforms | Local SparkSession with master("local[2]") |
| Databricks Connect | Integration testing against live workspace | DatabricksSession.builder.getOrCreate() |
| GitHub Actions | CI/CD automation | Workflows triggered on PR and push |
| Terraform | Workspace infrastructure management | terraform apply -var-file=prod.tfvars |
| Service Principals | Non-human identity for prod deploys | OAuth or PAT-based authentication |