Notebooks & Development Workflow

Easy 20 min read

Magic Commands

Why Magic Commands Matter

The Problem: Data teams use multiple languages -- Python for ETL, SQL for queries, Scala for performance, R for statistics. Switching between separate tools wastes time and breaks workflow.

The Solution: Databricks magic commands let you mix languages within a single notebook and access special utilities, all without leaving your development environment.

Real-World Analogy

Think of magic commands like a universal translator. Your notebook has a default language, but you can speak any language in any cell by prefixing it with the right magic command -- like switching channels on a radio.

Language Magic Commands

Databricks Notebook - Language Switching
# Default language: Python
# Cell 1: Python (default)
df = spark.read.format("delta").load("/mnt/data/sales")
df.createOrReplaceTempView("sales")

# Cell 2: Switch to SQL
%sql
SELECT region, SUM(amount) as total_sales
FROM sales
GROUP BY region
ORDER BY total_sales DESC

# Cell 3: Switch to Scala
%scala
val df = spark.table("sales")
df.groupBy("region").agg(sum("amount")).show()

# Cell 4: Switch to R
%r
library(SparkR)
df <- sql("SELECT * FROM sales")
head(df)

# Cell 5: Markdown for documentation
%md
## Sales Analysis Results
The above queries show regional sales breakdown.

Utility Magic Commands

Databricks Notebook - Utility Commands
# %run -- Execute another notebook (like an import)
%run /Shared/Libraries/common_utils

# %fs -- File system operations (shorthand for dbutils.fs)
%fs ls /mnt/data/

# %sh -- Run shell commands on the driver node
%sh
pip list | grep pandas
whoami
cat /etc/os-release

# %pip -- Install Python packages on the cluster
%pip install great-expectations==0.18.0

# %conda -- Manage Conda environments (ML Runtime only)
%conda install -c conda-forge lightgbm
Notebook Execution Flow
User Writes Code Notebook Service Parse Magic Cmds Manage State Display Results Spark Driver Execute Code Create Plan Coordinate Tasks Worker 1 Worker 2 Worker N Results returned to notebook Control Plane (Notebook Service) communicates with Data Plane (Driver + Workers)

Widgets & Parameters

Widgets add interactive controls to your notebooks, letting you parameterize queries and create reusable templates without modifying code.

Python - Notebook Widgets
# Text widget -- free-form input
dbutils.widgets.text("start_date", "2024-01-01", "Start Date")

# Dropdown widget -- select from options
dbutils.widgets.dropdown("environment", "dev",
    ["dev", "staging", "prod"], "Environment")

# Combobox widget -- dropdown with free-text option
dbutils.widgets.combobox("table_name", "users",
    ["users", "orders", "products"], "Table")

# Multiselect widget -- choose multiple values
dbutils.widgets.multiselect("regions", "US",
    ["US", "EU", "APAC", "LATAM"], "Regions")

# Retrieve widget values
start_date = dbutils.widgets.get("start_date")
env = dbutils.widgets.get("environment")
regions = dbutils.widgets.get("regions").split(",")

# Use in queries
df = spark.sql(f"""
    SELECT * FROM {env}_catalog.sales.transactions
    WHERE date >= '{start_date}'
    AND region IN ('{"','".join(regions)}')
""")

# Remove widgets when done
dbutils.widgets.removeAll()

Passing Parameters to Notebooks

Python - Notebook Orchestration with Parameters
# Parent notebook: orchestrate child notebooks
# dbutils.notebook.run(path, timeout_seconds, params)

# Run a child notebook with parameters
result = dbutils.notebook.run(
    "/ETL/process_daily_sales",
    timeout_seconds=3600,
    arguments={
        "date": "2024-03-15",
        "environment": "prod",
        "overwrite": "true"
    }
)
print(f"Child notebook returned: {result}")

# In the child notebook, return a value
dbutils.notebook.exit("SUCCESS: Processed 1,234,567 records")

# Run multiple notebooks in parallel
from concurrent.futures import ThreadPoolExecutor

notebooks = [
    ("/ETL/process_users", {"date": "2024-03-15"}),
    ("/ETL/process_orders", {"date": "2024-03-15"}),
    ("/ETL/process_products", {"date": "2024-03-15"}),
]

def run_nb(args):
    path, params = args
    return dbutils.notebook.run(path, 3600, params)

with ThreadPoolExecutor(max_workers=3) as pool:
    results = list(pool.map(run_nb, notebooks))

Git Integration

Databricks Repos provides native Git integration, bringing version control best practices directly into the workspace. You can clone repositories, create branches, commit, and push without leaving Databricks.

Clone Repositories

Clone from GitHub, GitLab, Azure DevOps, or Bitbucket. Supports HTTPS and SSH authentication methods.

Branch Management

Create feature branches, switch between branches, and merge changes. Full Git workflow without leaving the workspace.

Commit & Push

Stage changes, write commit messages, and push to remote from the Databricks UI or via the Repos API.

CI/CD Integration

Trigger deployment pipelines on push. Use the Repos API to programmatically update production branches.

Python - Repos API Operations
import requests

DATABRICKS_HOST = "https://adb-1234567890.12.azuredatabricks.net"
TOKEN = "dapi_your_token"
headers = {"Authorization": f"Bearer {TOKEN}"}

# Clone a repository
requests.post(
    f"{DATABRICKS_HOST}/api/2.0/repos",
    headers=headers,
    json={
        "url": "https://github.com/company/etl-pipelines.git",
        "provider": "gitHub",
        "path": "/Repos/team/etl-pipelines"
    }
)

# Update repo to latest (pull)
repo_id = 12345
requests.patch(
    f"{DATABRICKS_HOST}/api/2.0/repos/{repo_id}",
    headers=headers,
    json={"branch": "main"}
)

# Switch to a different branch
requests.patch(
    f"{DATABRICKS_HOST}/api/2.0/repos/{repo_id}",
    headers=headers,
    json={"branch": "feature/new-pipeline"}
)

Collaboration

Databricks notebooks support real-time collaboration features that help teams work together effectively.

Collaboration Features

  • Real-Time Co-Editing: Multiple users can edit the same notebook simultaneously with live cursors and changes
  • Comments: Add comments to specific cells for code review and discussion
  • Revision History: Every notebook maintains a complete revision history with the ability to restore previous versions
  • Access Control: Set permissions at the notebook, folder, or workspace level (CAN_VIEW, CAN_RUN, CAN_EDIT, CAN_MANAGE)
  • Dashboards: Convert notebook output cells into interactive dashboards for non-technical stakeholders
Python - Permissions API
# Set notebook permissions
notebook_id = "12345"
requests.put(
    f"{DATABRICKS_HOST}/api/2.0/permissions/notebooks/{notebook_id}",
    headers=headers,
    json={
        "access_control_list": [
            {
                "group_name": "data-engineers",
                "permission_level": "CAN_EDIT"
            },
            {
                "group_name": "data-analysts",
                "permission_level": "CAN_RUN"
            },
            {
                "user_name": "[email protected]",
                "permission_level": "CAN_VIEW"
            }
        ]
    }
)

# Permission levels:
# CAN_VIEW    - Read-only access to notebook
# CAN_RUN     - Can run cells but not edit
# CAN_EDIT    - Can edit and run
# CAN_MANAGE  - Full control including permissions

Notebook Best Practices

Practice Description Why It Matters
Use %md headers Document each section with markdown cells Self-documenting, auto-generates TOC
Parameterize with widgets Never hard-code dates, paths, or environments Reusable across environments and dates
Use %run for shared code Extract common functions into utility notebooks DRY principle, single source of truth
Git Repos for production Keep production code in version control Audit trail, rollback, code review
Exit with status Use dbutils.notebook.exit() to return results Enables orchestration and error handling

Practice Problems

Problem 1: Multi-Language Notebook

Easy

Design a notebook that: (1) uses Python to load a CSV file into a DataFrame, (2) uses SQL to perform aggregations, and (3) uses Markdown to document the results. What magic commands would you use?

Problem 2: Parameterized ETL

Easy

Create a reusable ETL notebook that accepts a date parameter and environment parameter (dev/prod). The notebook should read from the appropriate catalog based on environment and process data for the given date.

Problem 3: Git Workflow Design

Medium

Design a Git workflow for a team of 5 data engineers using Databricks Repos. Include branching strategy, code review process, and deployment to production. How would you prevent untested code from running in production?

Quick Reference

Command Purpose Example
%sqlRun SQL in a Python/Scala notebook%sql SELECT * FROM table
%pythonRun Python in a SQL/Scala notebook%python print("hello")
%scalaRun Scala code%scala val x = 1
%rRun R code%r library(SparkR)
%mdRender Markdown%md ## Heading
%runExecute another notebook%run /path/to/notebook
%pipInstall Python packages%pip install pandas
%shRun shell commands%sh ls /tmp
%fsFile system operations%fs ls /mnt/