Databricks Workspace & Architecture

Easy 20 min read

What is Databricks?

Why Databricks Matters

The Problem: Organizations struggle with fragmented data tools -- separate systems for data engineering, data science, machine learning, and analytics that do not communicate well with each other.

The Solution: Databricks provides a unified analytics platform built on Apache Spark that brings together data engineering, data science, and business analytics on a single collaborative platform.

Real Impact: Over 10,000 organizations worldwide use Databricks to process exabytes of data daily, from Fortune 500 companies to fast-growing startups.

Real-World Analogy

Think of Databricks as a modern office building:

  • Workspace = The building itself, housing all your teams
  • Notebooks = Individual offices where people do their work
  • Clusters = The power grid and utilities that run everything
  • Unity Catalog = The building directory that controls who can access what
  • Delta Lake = The secure vault where all important documents are stored

The Databricks Lakehouse Platform

Databricks pioneered the lakehouse architecture, which combines the best features of data warehouses (reliability, performance, governance) with data lakes (low-cost storage, flexibility, open formats). This means you get one platform for all your data workloads.

Data Engineering

Build reliable ETL/ELT pipelines with Delta Lake, Spark, and Delta Live Tables for production-quality data workflows.

Data Science & ML

Train, track, and deploy machine learning models with MLflow, collaborative notebooks, and GPU-accelerated clusters.

SQL Analytics

Run high-performance SQL queries on your lakehouse data with SQL Warehouses and build interactive dashboards.

Real-Time Analytics

Process streaming data with Structured Streaming and Delta Lake for real-time insights and decision-making.

Control Plane vs Data Plane Architecture

Databricks uses a separation of concerns architecture that splits the platform into two distinct layers. Understanding this split is essential for security, networking, and cost optimization.

Databricks Architecture: Control Plane vs Data Plane
Control Plane (Managed by Databricks) Web Application Cluster Manager Notebook Service Jobs Scheduler REST API Unity Catalog Identity & Access Management Encryption Key Management Data Plane (Your Cloud Account) Spark Clusters SQL Warehouses DBFS / Storage Delta Tables Cloud Object Storage (S3 / ADLS / GCS) VPC / VNet Networking Compute Instances (VMs) Secure Channel

Control Plane Deep Dive

The control plane is fully managed by Databricks and runs in Databricks' own cloud account. It handles all the management, orchestration, and UI components:

Component Responsibility Key Detail
Web Application UI for notebooks, dashboards, cluster management Accessible via browser at your workspace URL
Cluster Manager Provisions and manages compute resources Sends API calls to your cloud to spin up VMs
Jobs Scheduler Orchestrates scheduled and triggered workflows Cron-based or event-driven scheduling
Unity Catalog Centralized governance and metadata Data lineage, access control, auditing
REST API Programmatic access to all platform features Used by CLI, SDKs, and CI/CD pipelines

Data Plane Deep Dive

The data plane is where your data lives and where compute actually runs. This runs in your cloud account, giving you full control over data residency and network security:

Key Insight: Data Never Leaves Your Account
# Your data stays in YOUR cloud account
# Databricks control plane only sends instructions

# Example: What happens when you run a query
# 1. You write SQL in the notebook UI (control plane)
# 2. Control plane sends query to your cluster (data plane)
# 3. Cluster reads data from YOUR S3/ADLS/GCS bucket
# 4. Results are computed on YOUR VMs
# 5. Only the result set is sent back to the UI

# This architecture means:
# - Data residency compliance (GDPR, HIPAA)
# - Network isolation via VPC/VNet
# - You control encryption keys
# - Full audit trail in your cloud

Workspace Hierarchy

A Databricks workspace is organized in a hierarchical structure that mirrors how teams collaborate on data projects. Understanding this hierarchy is key to organizing your work effectively.

Users, Groups, and Permissions

Identity Model

Databricks supports multiple identity providers and access control levels:

  • Account-level: Manage users, groups, and service principals across all workspaces
  • Workspace-level: Control who can access specific workspace resources
  • SCIM Integration: Sync users and groups from Azure AD, Okta, or other IdPs
  • Service Principals: Non-human identities for CI/CD and automation

Workspace Objects

Notebooks

Interactive documents with code cells (Python, SQL, Scala, R), visualizations, and markdown. The primary development interface.

Repos (Git Integration)

Full Git integration for version control. Clone repos, create branches, commit, push, and pull directly from the workspace.

Folders

Organize notebooks, libraries, and files. Supports shared folders for team collaboration and user-private folders.

Libraries

Custom JARs, Python wheels, and PyPI/Maven packages that can be installed on clusters for your code to use.

Workspace API Examples

Python - Workspace API Calls
import requests
import json

# Configure your workspace connection
DATABRICKS_HOST = "https://adb-1234567890.12.azuredatabricks.net"
TOKEN = "dapi_your_personal_access_token"

headers = {
    "Authorization": f"Bearer {TOKEN}",
    "Content-Type": "application/json"
}

# List workspace contents
def list_workspace(path="/"):
    response = requests.get(
        f"{DATABRICKS_HOST}/api/2.0/workspace/list",
        headers=headers,
        params={"path": path}
    )
    return response.json()

# List all items in /Users directory
result = list_workspace("/Users")
for obj in result.get("objects", []):
    print(f"{obj['object_type']}: {obj['path']}")

# Export a notebook
def export_notebook(path, format="SOURCE"):
    response = requests.get(
        f"{DATABRICKS_HOST}/api/2.0/workspace/export",
        headers=headers,
        params={"path": path, "format": format}
    )
    return response.json()

# Import a notebook
def import_notebook(path, content, language="PYTHON"):
    payload = {
        "path": path,
        "language": language,
        "content": content,  # base64 encoded
        "overwrite": True
    }
    response = requests.post(
        f"{DATABRICKS_HOST}/api/2.0/workspace/import",
        headers=headers,
        json=payload
    )
    return response.json()

Cloud Provider Integration

Databricks runs on all three major cloud providers. While the core platform experience is identical, each cloud has unique integration points for networking, storage, and identity.

Feature AWS Azure GCP
Object Storage S3 ADLS Gen2 GCS
Networking VPC + PrivateLink VNet + Private Endpoints VPC + Private Service Connect
Identity IAM Roles Azure AD + Managed Identity IAM + Service Accounts
Key Management AWS KMS Azure Key Vault Cloud KMS
Compute EC2 instances Azure VMs GCE instances
Workspace URL *.cloud.databricks.com *.azuredatabricks.net *.gcp.databricks.com

Cloud Storage Mounting

Python - Mounting Cloud Storage
# AWS S3 Mount (using instance profile)
dbutils.fs.mount(
    source="s3a://my-data-bucket",
    mount_point="/mnt/data"
)

# Azure ADLS Gen2 Mount (using service principal)
configs = {
    "fs.azure.account.auth.type": "OAuth",
    "fs.azure.account.oauth.provider.type":
        "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "fs.azure.account.oauth2.client.id": "<app-id>",
    "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(
        scope="keyvault", key="sp-secret"
    ),
    "fs.azure.account.oauth2.client.endpoint":
        "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}

dbutils.fs.mount(
    source="abfss://[email protected]/",
    mount_point="/mnt/data",
    extra_configs=configs
)

# List mounted data
display(dbutils.fs.ls("/mnt/data"))

# Read data using the mount
df = spark.read.format("delta").load("/mnt/data/bronze/events")
df.show(5)

Getting Started with Your Workspace

Whether you are setting up a new workspace or joining an existing one, here is a step-by-step guide to getting productive quickly.

First Steps Checklist

New Workspace Setup

  • Step 1: Create your Databricks account at the appropriate cloud portal
  • Step 2: Configure identity provider integration (Azure AD, Okta, etc.)
  • Step 3: Set up networking -- deploy workspace into your VPC/VNet for security
  • Step 4: Create your first cluster (start with a small interactive cluster)
  • Step 5: Mount your cloud storage or configure external locations
  • Step 6: Create your first notebook and connect it to your cluster
  • Step 7: Set up Unity Catalog for data governance

Your First Notebook

Python - First Databricks Notebook
# Cell 1: Verify your Spark session
print(f"Spark version: {spark.version}")
print(f"Cluster: {spark.conf.get('spark.databricks.clusterUsageTags.clusterName')}")

# Cell 2: Explore available databases
display(spark.sql("SHOW DATABASES"))

# Cell 3: Create a simple DataFrame
data = [
    (1, "Alice", "Engineering", 95000),
    (2, "Bob", "Data Science", 105000),
    (3, "Carol", "Engineering", 98000),
    (4, "Dave", "Analytics", 87000),
]
columns = ["id", "name", "department", "salary"]

df = spark.createDataFrame(data, columns)
display(df)

# Cell 4: Use dbutils to explore the file system
display(dbutils.fs.ls("/"))

# Cell 5: Check available secrets scopes
print(dbutils.secrets.listScopes())

Practice Problems

Problem 1: Architecture Knowledge Check

Easy

Your security team asks: "Where does our data physically reside when we use Databricks?" How do you explain the control plane vs data plane architecture, and what are the security implications?

Problem 2: Workspace Organization

Easy

You are setting up a Databricks workspace for a team of 20 data engineers and 10 data scientists. Design a folder structure and access control strategy that balances collaboration with security.

Problem 3: Multi-Cloud Strategy

Medium

Your company uses AWS for production workloads but is evaluating Azure for European operations due to data residency requirements. How would you design a multi-cloud Databricks deployment?

Quick Reference

Databricks Architecture Cheat Sheet

Concept Description Key Point
Control Plane Managed by Databricks -- UI, API, scheduler No customer data stored here
Data Plane Runs in your cloud -- compute, storage All data stays in your account
Workspace Top-level container for all Databricks resources One workspace per team/environment is common
Unity Catalog Centralized governance layer Spans multiple workspaces
DBFS Databricks File System abstraction Backed by cloud object storage
Repos Git integration for notebooks and files Supports GitHub, GitLab, Azure DevOps, Bitbucket
Secrets Secure credential storage Backed by Databricks or Azure Key Vault

Useful Resources