Magic Commands
Why Magic Commands Matter
The Problem: Data teams use multiple languages -- Python for ETL, SQL for queries, Scala for performance, R for statistics. Switching between separate tools wastes time and breaks workflow.
The Solution: Databricks magic commands let you mix languages within a single notebook and access special utilities, all without leaving your development environment.
Real-World Analogy
Think of magic commands like a universal translator. Your notebook has a default language, but you can speak any language in any cell by prefixing it with the right magic command -- like switching channels on a radio.
Language Magic Commands
# Default language: Python
# Cell 1: Python (default)
df = spark.read.format("delta").load("/mnt/data/sales")
df.createOrReplaceTempView("sales")
# Cell 2: Switch to SQL
%sql
SELECT region, SUM(amount) as total_sales
FROM sales
GROUP BY region
ORDER BY total_sales DESC
# Cell 3: Switch to Scala
%scala
val df = spark.table("sales")
df.groupBy("region").agg(sum("amount")).show()
# Cell 4: Switch to R
%r
library(SparkR)
df <- sql("SELECT * FROM sales")
head(df)
# Cell 5: Markdown for documentation
%md
## Sales Analysis Results
The above queries show regional sales breakdown.
Utility Magic Commands
# %run -- Execute another notebook (like an import)
%run /Shared/Libraries/common_utils
# %fs -- File system operations (shorthand for dbutils.fs)
%fs ls /mnt/data/
# %sh -- Run shell commands on the driver node
%sh
pip list | grep pandas
whoami
cat /etc/os-release
# %pip -- Install Python packages on the cluster
%pip install great-expectations==0.18.0
# %conda -- Manage Conda environments (ML Runtime only)
%conda install -c conda-forge lightgbm
Widgets & Parameters
Widgets add interactive controls to your notebooks, letting you parameterize queries and create reusable templates without modifying code.
# Text widget -- free-form input
dbutils.widgets.text("start_date", "2024-01-01", "Start Date")
# Dropdown widget -- select from options
dbutils.widgets.dropdown("environment", "dev",
["dev", "staging", "prod"], "Environment")
# Combobox widget -- dropdown with free-text option
dbutils.widgets.combobox("table_name", "users",
["users", "orders", "products"], "Table")
# Multiselect widget -- choose multiple values
dbutils.widgets.multiselect("regions", "US",
["US", "EU", "APAC", "LATAM"], "Regions")
# Retrieve widget values
start_date = dbutils.widgets.get("start_date")
env = dbutils.widgets.get("environment")
regions = dbutils.widgets.get("regions").split(",")
# Use in queries
df = spark.sql(f"""
SELECT * FROM {env}_catalog.sales.transactions
WHERE date >= '{start_date}'
AND region IN ('{"','".join(regions)}')
""")
# Remove widgets when done
dbutils.widgets.removeAll()
Passing Parameters to Notebooks
# Parent notebook: orchestrate child notebooks
# dbutils.notebook.run(path, timeout_seconds, params)
# Run a child notebook with parameters
result = dbutils.notebook.run(
"/ETL/process_daily_sales",
timeout_seconds=3600,
arguments={
"date": "2024-03-15",
"environment": "prod",
"overwrite": "true"
}
)
print(f"Child notebook returned: {result}")
# In the child notebook, return a value
dbutils.notebook.exit("SUCCESS: Processed 1,234,567 records")
# Run multiple notebooks in parallel
from concurrent.futures import ThreadPoolExecutor
notebooks = [
("/ETL/process_users", {"date": "2024-03-15"}),
("/ETL/process_orders", {"date": "2024-03-15"}),
("/ETL/process_products", {"date": "2024-03-15"}),
]
def run_nb(args):
path, params = args
return dbutils.notebook.run(path, 3600, params)
with ThreadPoolExecutor(max_workers=3) as pool:
results = list(pool.map(run_nb, notebooks))
Git Integration
Databricks Repos provides native Git integration, bringing version control best practices directly into the workspace. You can clone repositories, create branches, commit, and push without leaving Databricks.
Clone Repositories
Clone from GitHub, GitLab, Azure DevOps, or Bitbucket. Supports HTTPS and SSH authentication methods.
Branch Management
Create feature branches, switch between branches, and merge changes. Full Git workflow without leaving the workspace.
Commit & Push
Stage changes, write commit messages, and push to remote from the Databricks UI or via the Repos API.
CI/CD Integration
Trigger deployment pipelines on push. Use the Repos API to programmatically update production branches.
import requests
DATABRICKS_HOST = "https://adb-1234567890.12.azuredatabricks.net"
TOKEN = "dapi_your_token"
headers = {"Authorization": f"Bearer {TOKEN}"}
# Clone a repository
requests.post(
f"{DATABRICKS_HOST}/api/2.0/repos",
headers=headers,
json={
"url": "https://github.com/company/etl-pipelines.git",
"provider": "gitHub",
"path": "/Repos/team/etl-pipelines"
}
)
# Update repo to latest (pull)
repo_id = 12345
requests.patch(
f"{DATABRICKS_HOST}/api/2.0/repos/{repo_id}",
headers=headers,
json={"branch": "main"}
)
# Switch to a different branch
requests.patch(
f"{DATABRICKS_HOST}/api/2.0/repos/{repo_id}",
headers=headers,
json={"branch": "feature/new-pipeline"}
)
Collaboration
Databricks notebooks support real-time collaboration features that help teams work together effectively.
Collaboration Features
- Real-Time Co-Editing: Multiple users can edit the same notebook simultaneously with live cursors and changes
- Comments: Add comments to specific cells for code review and discussion
- Revision History: Every notebook maintains a complete revision history with the ability to restore previous versions
- Access Control: Set permissions at the notebook, folder, or workspace level (CAN_VIEW, CAN_RUN, CAN_EDIT, CAN_MANAGE)
- Dashboards: Convert notebook output cells into interactive dashboards for non-technical stakeholders
# Set notebook permissions
notebook_id = "12345"
requests.put(
f"{DATABRICKS_HOST}/api/2.0/permissions/notebooks/{notebook_id}",
headers=headers,
json={
"access_control_list": [
{
"group_name": "data-engineers",
"permission_level": "CAN_EDIT"
},
{
"group_name": "data-analysts",
"permission_level": "CAN_RUN"
},
{
"user_name": "[email protected]",
"permission_level": "CAN_VIEW"
}
]
}
)
# Permission levels:
# CAN_VIEW - Read-only access to notebook
# CAN_RUN - Can run cells but not edit
# CAN_EDIT - Can edit and run
# CAN_MANAGE - Full control including permissions
Notebook Best Practices
| Practice | Description | Why It Matters |
|---|---|---|
| Use %md headers | Document each section with markdown cells | Self-documenting, auto-generates TOC |
| Parameterize with widgets | Never hard-code dates, paths, or environments | Reusable across environments and dates |
| Use %run for shared code | Extract common functions into utility notebooks | DRY principle, single source of truth |
| Git Repos for production | Keep production code in version control | Audit trail, rollback, code review |
| Exit with status | Use dbutils.notebook.exit() to return results | Enables orchestration and error handling |
Practice Problems
Problem 1: Multi-Language Notebook
EasyDesign a notebook that: (1) uses Python to load a CSV file into a DataFrame, (2) uses SQL to perform aggregations, and (3) uses Markdown to document the results. What magic commands would you use?
Problem 2: Parameterized ETL
EasyCreate a reusable ETL notebook that accepts a date parameter and environment parameter (dev/prod). The notebook should read from the appropriate catalog based on environment and process data for the given date.
Problem 3: Git Workflow Design
MediumDesign a Git workflow for a team of 5 data engineers using Databricks Repos. Include branching strategy, code review process, and deployment to production. How would you prevent untested code from running in production?
Quick Reference
| Command | Purpose | Example |
|---|---|---|
| %sql | Run SQL in a Python/Scala notebook | %sql SELECT * FROM table |
| %python | Run Python in a SQL/Scala notebook | %python print("hello") |
| %scala | Run Scala code | %scala val x = 1 |
| %r | Run R code | %r library(SparkR) |
| %md | Render Markdown | %md ## Heading |
| %run | Execute another notebook | %run /path/to/notebook |
| %pip | Install Python packages | %pip install pandas |
| %sh | Run shell commands | %sh ls /tmp |
| %fs | File system operations | %fs ls /mnt/ |