Notebooks & Development Workflow

Magic Commands

Why Magic Commands Matter

The Problem: Data teams use multiple languages -- Python for ETL, SQL for queries, Scala for performance, R for statistics. Switching between separate tools wastes time and breaks workflow.

The Solution: Databricks magic commands let you mix languages within a single notebook and access special utilities, all without leaving your development environment.

Real-World Analogy

Think of magic commands like a universal translator. Your notebook has a default language, but you can speak any language in any cell by prefixing it with the right magic command -- like switching channels on a radio.

Language Magic Commands

Databricks Notebook - Language Switching

# Default language: Python
# Cell 1: Python (default)
df = spark.read.format("delta").load("/mnt/data/sales")
df.createOrReplaceTempView("sales")

# Cell 2: Switch to SQL
%sql
SELECT region, SUM(amount) as total_sales
FROM sales
GROUP BY region
ORDER BY total_sales DESC

# Cell 3: Switch to Scala
%scala
val df = spark.table("sales")
df.groupBy("region").agg(sum("amount")).show()

# Cell 4: Switch to R
%r
library(SparkR)
df <- sql("SELECT * FROM sales")
head(df)

# Cell 5: Markdown for documentation
%md
## Sales Analysis Results
The above queries show regional sales breakdown.

Utility Magic Commands

Databricks Notebook - Utility Commands

# %run -- Execute another notebook (like an import)
%run /Shared/Libraries/common_utils

# %fs -- File system operations (shorthand for dbutils.fs)
%fs ls /mnt/data/

# %sh -- Run shell commands on the driver node
%sh
pip list | grep pandas
whoami
cat /etc/os-release

# %pip -- Install Python packages on the cluster
%pip install great-expectations==0.18.0

# %conda -- Manage Conda environments (ML Runtime only)
%conda install -c conda-forge lightgbm

Notebook Execution Flow

Widgets & Parameters

Widgets add interactive controls to your notebooks, letting you parameterize queries and create reusable templates without modifying code.

Python - Notebook Widgets

# Text widget -- free-form input
dbutils.widgets.text("start_date", "2024-01-01", "Start Date")

# Dropdown widget -- select from options
dbutils.widgets.dropdown("environment", "dev",
    ["dev", "staging", "prod"], "Environment")

# Combobox widget -- dropdown with free-text option
dbutils.widgets.combobox("table_name", "users",
    ["users", "orders", "products"], "Table")

# Multiselect widget -- choose multiple values
dbutils.widgets.multiselect("regions", "US",
    ["US", "EU", "APAC", "LATAM"], "Regions")

# Retrieve widget values
start_date = dbutils.widgets.get("start_date")
env = dbutils.widgets.get("environment")
regions = dbutils.widgets.get("regions").split(",")

# Use in queries
df = spark.sql(f"""
    SELECT * FROM {env}_catalog.sales.transactions
    WHERE date >= '{start_date}'
    AND region IN ('{"','".join(regions)}')
""")

# Remove widgets when done
dbutils.widgets.removeAll()

Passing Parameters to Notebooks

Python - Notebook Orchestration with Parameters

# Parent notebook: orchestrate child notebooks
# dbutils.notebook.run(path, timeout_seconds, params)

# Run a child notebook with parameters
result = dbutils.notebook.run(
    "/ETL/process_daily_sales",
    timeout_seconds=3600,
    arguments={
        "date": "2024-03-15",
        "environment": "prod",
        "overwrite": "true"
    }
)
print(f"Child notebook returned: {result}")

# In the child notebook, return a value
dbutils.notebook.exit("SUCCESS: Processed 1,234,567 records")

# Run multiple notebooks in parallel
from concurrent.futures import ThreadPoolExecutor

notebooks = [
    ("/ETL/process_users", {"date": "2024-03-15"}),
    ("/ETL/process_orders", {"date": "2024-03-15"}),
    ("/ETL/process_products", {"date": "2024-03-15"}),
]

def run_nb(args):
    path, params = args
    return dbutils.notebook.run(path, 3600, params)

with ThreadPoolExecutor(max_workers=3) as pool:
    results = list(pool.map(run_nb, notebooks))

Git Integration

Databricks Repos provides native Git integration, bringing version control best practices directly into the workspace. You can clone repositories, create branches, commit, and push without leaving Databricks.

Clone Repositories

Clone from GitHub, GitLab, Azure DevOps, or Bitbucket. Supports HTTPS and SSH authentication methods.

Branch Management

Create feature branches, switch between branches, and merge changes. Full Git workflow without leaving the workspace.

Commit & Push

Stage changes, write commit messages, and push to remote from the Databricks UI or via the Repos API.

CI/CD Integration

Trigger deployment pipelines on push. Use the Repos API to programmatically update production branches.

Python - Repos API Operations

import requests

DATABRICKS_HOST = "https://adb-1234567890.12.azuredatabricks.net"
TOKEN = "dapi_your_token"
headers = {"Authorization": f"Bearer {TOKEN}"}

# Clone a repository
requests.post(
    f"{DATABRICKS_HOST}/api/2.0/repos",
    headers=headers,
    json={
        "url": "https://github.com/company/etl-pipelines.git",
        "provider": "gitHub",
        "path": "/Repos/team/etl-pipelines"
    }
)

# Update repo to latest (pull)
repo_id = 12345
requests.patch(
    f"{DATABRICKS_HOST}/api/2.0/repos/{repo_id}",
    headers=headers,
    json={"branch": "main"}
)

# Switch to a different branch
requests.patch(
    f"{DATABRICKS_HOST}/api/2.0/repos/{repo_id}",
    headers=headers,
    json={"branch": "feature/new-pipeline"}
)

Collaboration

Databricks notebooks support real-time collaboration features that help teams work together effectively.

Collaboration Features

Real-Time Co-Editing: Multiple users can edit the same notebook simultaneously with live cursors and changes
Comments: Add comments to specific cells for code review and discussion
Revision History: Every notebook maintains a complete revision history with the ability to restore previous versions
Access Control: Set permissions at the notebook, folder, or workspace level (CAN_VIEW, CAN_RUN, CAN_EDIT, CAN_MANAGE)
Dashboards: Convert notebook output cells into interactive dashboards for non-technical stakeholders

Python - Permissions API

# Set notebook permissions
notebook_id = "12345"
requests.put(
    f"{DATABRICKS_HOST}/api/2.0/permissions/notebooks/{notebook_id}",
    headers=headers,
    json={
        "access_control_list": [
            {
                "group_name": "data-engineers",
                "permission_level": "CAN_EDIT"
            },
            {
                "group_name": "data-analysts",
                "permission_level": "CAN_RUN"
            },
            {
                "user_name": "[email protected]",
                "permission_level": "CAN_VIEW"
            }
        ]
    }
)

# Permission levels:
# CAN_VIEW    - Read-only access to notebook
# CAN_RUN     - Can run cells but not edit
# CAN_EDIT    - Can edit and run
# CAN_MANAGE  - Full control including permissions

Notebook Best Practices

Practice	Description	Why It Matters
Use %md headers	Document each section with markdown cells	Self-documenting, auto-generates TOC
Parameterize with widgets	Never hard-code dates, paths, or environments	Reusable across environments and dates
Use %run for shared code	Extract common functions into utility notebooks	DRY principle, single source of truth
Git Repos for production	Keep production code in version control	Audit trail, rollback, code review
Exit with status	Use dbutils.notebook.exit() to return results	Enables orchestration and error handling

Practice Problems

Problem 1: Multi-Language Notebook

Easy

Design a notebook that: (1) uses Python to load a CSV file into a DataFrame, (2) uses SQL to perform aggregations, and (3) uses Markdown to document the results. What magic commands would you use?

Problem 2: Parameterized ETL