What is Databricks?
Why Databricks Matters
The Problem: Organizations struggle with fragmented data tools -- separate systems for data engineering, data science, machine learning, and analytics that do not communicate well with each other.
The Solution: Databricks provides a unified analytics platform built on Apache Spark that brings together data engineering, data science, and business analytics on a single collaborative platform.
Real Impact: Over 10,000 organizations worldwide use Databricks to process exabytes of data daily, from Fortune 500 companies to fast-growing startups.
Real-World Analogy
Think of Databricks as a modern office building:
- Workspace = The building itself, housing all your teams
- Notebooks = Individual offices where people do their work
- Clusters = The power grid and utilities that run everything
- Unity Catalog = The building directory that controls who can access what
- Delta Lake = The secure vault where all important documents are stored
The Databricks Lakehouse Platform
Databricks pioneered the lakehouse architecture, which combines the best features of data warehouses (reliability, performance, governance) with data lakes (low-cost storage, flexibility, open formats). This means you get one platform for all your data workloads.
Data Engineering
Build reliable ETL/ELT pipelines with Delta Lake, Spark, and Delta Live Tables for production-quality data workflows.
Data Science & ML
Train, track, and deploy machine learning models with MLflow, collaborative notebooks, and GPU-accelerated clusters.
SQL Analytics
Run high-performance SQL queries on your lakehouse data with SQL Warehouses and build interactive dashboards.
Real-Time Analytics
Process streaming data with Structured Streaming and Delta Lake for real-time insights and decision-making.
Control Plane vs Data Plane Architecture
Databricks uses a separation of concerns architecture that splits the platform into two distinct layers. Understanding this split is essential for security, networking, and cost optimization.
Control Plane Deep Dive
The control plane is fully managed by Databricks and runs in Databricks' own cloud account. It handles all the management, orchestration, and UI components:
| Component | Responsibility | Key Detail |
|---|---|---|
| Web Application | UI for notebooks, dashboards, cluster management | Accessible via browser at your workspace URL |
| Cluster Manager | Provisions and manages compute resources | Sends API calls to your cloud to spin up VMs |
| Jobs Scheduler | Orchestrates scheduled and triggered workflows | Cron-based or event-driven scheduling |
| Unity Catalog | Centralized governance and metadata | Data lineage, access control, auditing |
| REST API | Programmatic access to all platform features | Used by CLI, SDKs, and CI/CD pipelines |
Data Plane Deep Dive
The data plane is where your data lives and where compute actually runs. This runs in your cloud account, giving you full control over data residency and network security:
# Your data stays in YOUR cloud account
# Databricks control plane only sends instructions
# Example: What happens when you run a query
# 1. You write SQL in the notebook UI (control plane)
# 2. Control plane sends query to your cluster (data plane)
# 3. Cluster reads data from YOUR S3/ADLS/GCS bucket
# 4. Results are computed on YOUR VMs
# 5. Only the result set is sent back to the UI
# This architecture means:
# - Data residency compliance (GDPR, HIPAA)
# - Network isolation via VPC/VNet
# - You control encryption keys
# - Full audit trail in your cloud
Workspace Hierarchy
A Databricks workspace is organized in a hierarchical structure that mirrors how teams collaborate on data projects. Understanding this hierarchy is key to organizing your work effectively.
Users, Groups, and Permissions
Identity Model
Databricks supports multiple identity providers and access control levels:
- Account-level: Manage users, groups, and service principals across all workspaces
- Workspace-level: Control who can access specific workspace resources
- SCIM Integration: Sync users and groups from Azure AD, Okta, or other IdPs
- Service Principals: Non-human identities for CI/CD and automation
Workspace Objects
Notebooks
Interactive documents with code cells (Python, SQL, Scala, R), visualizations, and markdown. The primary development interface.
Repos (Git Integration)
Full Git integration for version control. Clone repos, create branches, commit, push, and pull directly from the workspace.
Folders
Organize notebooks, libraries, and files. Supports shared folders for team collaboration and user-private folders.
Libraries
Custom JARs, Python wheels, and PyPI/Maven packages that can be installed on clusters for your code to use.
Workspace API Examples
import requests
import json
# Configure your workspace connection
DATABRICKS_HOST = "https://adb-1234567890.12.azuredatabricks.net"
TOKEN = "dapi_your_personal_access_token"
headers = {
"Authorization": f"Bearer {TOKEN}",
"Content-Type": "application/json"
}
# List workspace contents
def list_workspace(path="/"):
response = requests.get(
f"{DATABRICKS_HOST}/api/2.0/workspace/list",
headers=headers,
params={"path": path}
)
return response.json()
# List all items in /Users directory
result = list_workspace("/Users")
for obj in result.get("objects", []):
print(f"{obj['object_type']}: {obj['path']}")
# Export a notebook
def export_notebook(path, format="SOURCE"):
response = requests.get(
f"{DATABRICKS_HOST}/api/2.0/workspace/export",
headers=headers,
params={"path": path, "format": format}
)
return response.json()
# Import a notebook
def import_notebook(path, content, language="PYTHON"):
payload = {
"path": path,
"language": language,
"content": content, # base64 encoded
"overwrite": True
}
response = requests.post(
f"{DATABRICKS_HOST}/api/2.0/workspace/import",
headers=headers,
json=payload
)
return response.json()
Cloud Provider Integration
Databricks runs on all three major cloud providers. While the core platform experience is identical, each cloud has unique integration points for networking, storage, and identity.
| Feature | AWS | Azure | GCP |
|---|---|---|---|
| Object Storage | S3 | ADLS Gen2 | GCS |
| Networking | VPC + PrivateLink | VNet + Private Endpoints | VPC + Private Service Connect |
| Identity | IAM Roles | Azure AD + Managed Identity | IAM + Service Accounts |
| Key Management | AWS KMS | Azure Key Vault | Cloud KMS |
| Compute | EC2 instances | Azure VMs | GCE instances |
| Workspace URL | *.cloud.databricks.com | *.azuredatabricks.net | *.gcp.databricks.com |
Cloud Storage Mounting
# AWS S3 Mount (using instance profile)
dbutils.fs.mount(
source="s3a://my-data-bucket",
mount_point="/mnt/data"
)
# Azure ADLS Gen2 Mount (using service principal)
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type":
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<app-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(
scope="keyvault", key="sp-secret"
),
"fs.azure.account.oauth2.client.endpoint":
"https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}
dbutils.fs.mount(
source="abfss://[email protected]/",
mount_point="/mnt/data",
extra_configs=configs
)
# List mounted data
display(dbutils.fs.ls("/mnt/data"))
# Read data using the mount
df = spark.read.format("delta").load("/mnt/data/bronze/events")
df.show(5)
Getting Started with Your Workspace
Whether you are setting up a new workspace or joining an existing one, here is a step-by-step guide to getting productive quickly.
First Steps Checklist
New Workspace Setup
- Step 1: Create your Databricks account at the appropriate cloud portal
- Step 2: Configure identity provider integration (Azure AD, Okta, etc.)
- Step 3: Set up networking -- deploy workspace into your VPC/VNet for security
- Step 4: Create your first cluster (start with a small interactive cluster)
- Step 5: Mount your cloud storage or configure external locations
- Step 6: Create your first notebook and connect it to your cluster
- Step 7: Set up Unity Catalog for data governance
Your First Notebook
# Cell 1: Verify your Spark session
print(f"Spark version: {spark.version}")
print(f"Cluster: {spark.conf.get('spark.databricks.clusterUsageTags.clusterName')}")
# Cell 2: Explore available databases
display(spark.sql("SHOW DATABASES"))
# Cell 3: Create a simple DataFrame
data = [
(1, "Alice", "Engineering", 95000),
(2, "Bob", "Data Science", 105000),
(3, "Carol", "Engineering", 98000),
(4, "Dave", "Analytics", 87000),
]
columns = ["id", "name", "department", "salary"]
df = spark.createDataFrame(data, columns)
display(df)
# Cell 4: Use dbutils to explore the file system
display(dbutils.fs.ls("/"))
# Cell 5: Check available secrets scopes
print(dbutils.secrets.listScopes())
Practice Problems
Problem 1: Architecture Knowledge Check
EasyYour security team asks: "Where does our data physically reside when we use Databricks?" How do you explain the control plane vs data plane architecture, and what are the security implications?
Problem 2: Workspace Organization
EasyYou are setting up a Databricks workspace for a team of 20 data engineers and 10 data scientists. Design a folder structure and access control strategy that balances collaboration with security.
Problem 3: Multi-Cloud Strategy
MediumYour company uses AWS for production workloads but is evaluating Azure for European operations due to data residency requirements. How would you design a multi-cloud Databricks deployment?
Quick Reference
Databricks Architecture Cheat Sheet
| Concept | Description | Key Point |
|---|---|---|
| Control Plane | Managed by Databricks -- UI, API, scheduler | No customer data stored here |
| Data Plane | Runs in your cloud -- compute, storage | All data stays in your account |
| Workspace | Top-level container for all Databricks resources | One workspace per team/environment is common |
| Unity Catalog | Centralized governance layer | Spans multiple workspaces |
| DBFS | Databricks File System abstraction | Backed by cloud object storage |
| Repos | Git integration for notebooks and files | Supports GitHub, GitLab, Azure DevOps, Bitbucket |
| Secrets | Secure credential storage | Backed by Databricks or Azure Key Vault |