Databricks Workspace & Architecture

What is Databricks?

Why Databricks Matters

The Problem: Organizations struggle with fragmented data tools -- separate systems for data engineering, data science, machine learning, and analytics that do not communicate well with each other.

The Solution: Databricks provides a unified analytics platform built on Apache Spark that brings together data engineering, data science, and business analytics on a single collaborative platform.

Real Impact: Over 10,000 organizations worldwide use Databricks to process exabytes of data daily, from Fortune 500 companies to fast-growing startups.

Real-World Analogy

Think of Databricks as a modern office building:

Workspace = The building itself, housing all your teams
Notebooks = Individual offices where people do their work
Clusters = The power grid and utilities that run everything
Unity Catalog = The building directory that controls who can access what
Delta Lake = The secure vault where all important documents are stored

The Databricks Lakehouse Platform

Databricks pioneered the lakehouse architecture, which combines the best features of data warehouses (reliability, performance, governance) with data lakes (low-cost storage, flexibility, open formats). This means you get one platform for all your data workloads.

Data Engineering

Build reliable ETL/ELT pipelines with Delta Lake, Spark, and Delta Live Tables for production-quality data workflows.

Data Science & ML

Train, track, and deploy machine learning models with MLflow, collaborative notebooks, and GPU-accelerated clusters.

SQL Analytics

Run high-performance SQL queries on your lakehouse data with SQL Warehouses and build interactive dashboards.

Real-Time Analytics

Process streaming data with Structured Streaming and Delta Lake for real-time insights and decision-making.

Control Plane vs Data Plane Architecture

Databricks uses a separation of concerns architecture that splits the platform into two distinct layers. Understanding this split is essential for security, networking, and cost optimization.

Databricks Architecture: Control Plane vs Data Plane

Control Plane Deep Dive

The control plane is fully managed by Databricks and runs in Databricks' own cloud account. It handles all the management, orchestration, and UI components:

Component	Responsibility	Key Detail
Web Application	UI for notebooks, dashboards, cluster management	Accessible via browser at your workspace URL
Cluster Manager	Provisions and manages compute resources	Sends API calls to your cloud to spin up VMs
Jobs Scheduler	Orchestrates scheduled and triggered workflows	Cron-based or event-driven scheduling
Unity Catalog	Centralized governance and metadata	Data lineage, access control, auditing
REST API	Programmatic access to all platform features	Used by CLI, SDKs, and CI/CD pipelines

Data Plane Deep Dive

The data plane is where your data lives and where compute actually runs. This runs in your cloud account, giving you full control over data residency and network security:

Key Insight: Data Never Leaves Your Account

# Your data stays in YOUR cloud account
# Databricks control plane only sends instructions

# Example: What happens when you run a query
# 1. You write SQL in the notebook UI (control plane)
# 2. Control plane sends query to your cluster (data plane)
# 3. Cluster reads data from YOUR S3/ADLS/GCS bucket
# 4. Results are computed on YOUR VMs
# 5. Only the result set is sent back to the UI

# This architecture means:
# - Data residency compliance (GDPR, HIPAA)
# - Network isolation via VPC/VNet
# - You control encryption keys
# - Full audit trail in your cloud

Workspace Hierarchy

A Databricks workspace is organized in a hierarchical structure that mirrors how teams collaborate on data projects. Understanding this hierarchy is key to organizing your work effectively.

Users, Groups, and Permissions

Identity Model

Databricks supports multiple identity providers and access control levels:

Account-level: Manage users, groups, and service principals across all workspaces
Workspace-level: Control who can access specific workspace resources
SCIM Integration: Sync users and groups from Azure AD, Okta, or other IdPs
Service Principals: Non-human identities for CI/CD and automation

Workspace Objects

Notebooks

Interactive documents with code cells (Python, SQL, Scala, R), visualizations, and markdown. The primary development interface.

Repos (Git Integration)

Full Git integration for version control. Clone repos, create branches, commit, push, and pull directly from the workspace.

Folders

Organize notebooks, libraries, and files. Supports shared folders for team collaboration and user-private folders.

Libraries

Custom JARs, Python wheels, and PyPI/Maven packages that can be installed on clusters for your code to use.

Workspace API Examples

Python - Workspace API Calls

import requests
import json

# Configure your workspace connection
DATABRICKS_HOST = "https://adb-1234567890.12.azuredatabricks.net"
TOKEN = "dapi_your_personal_access_token"

headers = {
    "Authorization": f"Bearer {TOKEN}",
    "Content-Type": "application/json"
}

# List workspace contents
def list_workspace(path="/"):
    response = requests.get(
        f"{DATABRICKS_HOST}/api/2.0/workspace/list",
        headers=headers,
        params={"path": path}
    )
    return response.json()

# List all items in /Users directory
result = list_workspace("/Users")
for obj in result.get("objects", []):
    print(f"{obj['object_type']}: {obj['path']}")

# Export a notebook
def export_notebook(path, format="SOURCE"):
    response = requests.get(
        f"{DATABRICKS_HOST}/api/2.0/workspace/export",
        headers=headers,
        params={"path": path, "format": format}
    )
    return response.json()

# Import a notebook
def import_notebook(path, content, language="PYTHON"):
    payload = {
        "path": path,
        "language": language,
        "content": content,  # base64 encoded
        "overwrite": True
    }
    response = requests.post(
        f"{DATABRICKS_HOST}/api/2.0/workspace/import",
        headers=headers,
        json=payload
    )
    return response.json()

Cloud Provider Integration

Databricks runs on all three major cloud providers. While the core platform experience is identical, each cloud has unique integration points for networking, storage, and identity.

Feature	AWS	Azure	GCP
Object Storage	S3	ADLS Gen2	GCS
Networking	VPC + PrivateLink	VNet + Private Endpoints	VPC + Private Service Connect
Identity	IAM Roles	Azure AD + Managed Identity	IAM + Service Accounts
Key Management	AWS KMS	Azure Key Vault	Cloud KMS
Compute	EC2 instances	Azure VMs	GCE instances
Workspace URL	*.cloud.databricks.com	*.azuredatabricks.net	*.gcp.databricks.com

Cloud Storage Mounting

Python - Mounting Cloud Storage

# AWS S3 Mount (using instance profile)
dbutils.fs.mount(
    source="s3a://my-data-bucket",
    mount_point="/mnt/data"
)

# Azure ADLS Gen2 Mount (using service principal)
configs = {
    "fs.azure.account.auth.type": "OAuth",
    "fs.azure.account.oauth.provider.type":
        "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "fs.azure.account.oauth2.client.id": "<app-id>",
    "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(
        scope="keyvault", key="sp-secret"
    ),
    "fs.azure.account.oauth2.client.endpoint":
        "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}

dbutils.fs.mount(
    source="abfss://[email protected]/",
    mount_point="/mnt/data",
    extra_configs=configs
)

# List mounted data
display(dbutils.fs.ls("/mnt/data"))

# Read data using the mount
df = spark.read.format("delta").load("/mnt/data/bronze/events")
df.show(5)

Getting Started with Your Workspace

Whether you are setting up a new workspace or joining an existing one, here is a step-by-step guide to getting productive quickly.

First Steps Checklist

New Workspace Setup

Step 1: Create your Databricks account at the appropriate cloud portal
Step 2: Configure identity provider integration (Azure AD, Okta, etc.)
Step 3: Set up networking -- deploy workspace into your VPC/VNet for security
Step 4: Create your first cluster (start with a small interactive cluster)
Step 5: Mount your cloud storage or configure external locations
Step 6: Create your first notebook and connect it to your cluster
Step 7: Set up Unity Catalog for data governance

Your First Notebook

Python - First Databricks Notebook

# Cell 1: Verify your Spark session
print(f"Spark version: {spark.version}")
print(f"Cluster: {spark.conf.get('spark.databricks.clusterUsageTags.clusterName')}")

# Cell 2: Explore available databases
display(spark.sql("SHOW DATABASES"))

# Cell 3: Create a simple DataFrame
data = [
    (1, "Alice", "Engineering", 95000),
    (2, "Bob", "Data Science", 105000),
    (3, "Carol", "Engineering", 98000),
    (4, "Dave", "Analytics", 87000),
]
columns = ["id", "name", "department", "salary"]

df = spark.createDataFrame(data, columns)
display(df)

# Cell 4: Use dbutils to explore the file system
display(dbutils.fs.ls("/"))

# Cell 5: Check available secrets scopes
print(dbutils.secrets.listScopes())

Practice Problems

Problem 1: Architecture Knowledge Check

Easy

Your security team asks: "Where does our data physically reside when we use Databricks?" How do you explain the control plane vs data plane architecture, and what are the security implications?

Problem 2: Workspace Organization

Easy

You are setting up a Databricks workspace for a team of 20 data engineers and 10 data scientists. Design a folder structure and access control strategy that balances collaboration with security.

Problem 3: Multi-Cloud Strategy

Medium

Your company uses AWS for production workloads but is evaluating Azure for European operations due to data residency requirements. How would you design a multi-cloud Databricks deployment?

Quick Reference

Databricks Architecture Cheat Sheet

Concept	Description	Key Point
Control Plane	Managed by Databricks -- UI, API, scheduler	No customer data stored here
Data Plane	Runs in your cloud -- compute, storage	All data stays in your account
Workspace	Top-level container for all Databricks resources	One workspace per team/environment is common
Unity Catalog	Centralized governance layer	Spans multiple workspaces
DBFS	Databricks File System abstraction	Backed by cloud object storage
Repos	Git integration for notebooks and files	Supports GitHub, GitLab, Azure DevOps, Bitbucket
Secrets	Secure credential storage	Backed by Databricks or Azure Key Vault

Databricks Workspace & Architecture

What is Databricks?

Why Databricks Matters

Real-World Analogy

The Databricks Lakehouse Platform

Data Engineering

Data Science & ML

SQL Analytics

Real-Time Analytics

Control Plane vs Data Plane Architecture

Control Plane Deep Dive

Data Plane Deep Dive

Workspace Hierarchy

Users, Groups, and Permissions

Identity Model

Workspace Objects

Notebooks

Repos (Git Integration)

Folders

Libraries

Workspace API Examples

Cloud Provider Integration

Cloud Storage Mounting

Getting Started with Your Workspace

First Steps Checklist

New Workspace Setup

Your First Notebook

Practice Problems

Problem 1: Architecture Knowledge Check

Problem 2: Workspace Organization

Problem 3: Multi-Cloud Strategy

Quick Reference

Databricks Architecture Cheat Sheet

Useful Resources

Recommended Reading

What is Databricks?

Why Databricks Matters

Real-World Analogy

The Databricks Lakehouse Platform

Data Engineering

Data Science & ML

SQL Analytics

Real-Time Analytics

Control Plane vs Data Plane Architecture

Control Plane Deep Dive

Data Plane Deep Dive

Workspace Hierarchy

Users, Groups, and Permissions

Identity Model

Workspace Objects

Notebooks

Repos (Git Integration)

Folders

Libraries

Workspace API Examples

Cloud Provider Integration

Cloud Storage Mounting

Getting Started with Your Workspace

First Steps Checklist

New Workspace Setup

Your First Notebook

Practice Problems

Problem 1: Architecture Knowledge Check

Problem 2: Workspace Organization

Problem 3: Multi-Cloud Strategy

Quick Reference

Databricks Architecture Cheat Sheet

Useful Resources

Recommended Reading

Related Topics