Security, IAM & Network Controls

Hard 25 min read

Identity & Access Management

Why Security Matters in Databricks

The Problem: Data platforms handle sensitive business data, PII, and financial records. A misconfigured access control can lead to data breaches, compliance violations, and significant financial penalties.

The Solution: Databricks provides a layered security model with identity management, fine-grained access controls via Unity Catalog, network isolation, encryption, and comprehensive audit logging.

Real Impact: Organizations in regulated industries (healthcare, finance, government) use Databricks security features to achieve HIPAA, SOC 2, FedRAMP, and GDPR compliance.

Databricks identity management operates at two levels: account-level (across all workspaces) and workspace-level (within a single workspace). Understanding this hierarchy is essential for designing secure multi-team environments.

Databricks IAM Role Mapping Architecture
Identity Provider (IdP) Azure AD / Okta / OneLogin SCIM Sync Account Level Account Admin Users & Groups Service Principals Centralized identity management across all workspaces Workspace: Production WS Admin Data Engineer Data Analyst Read Only Workspace: Staging WS Admin Data Engineer CI/CD Service Principal Workspace: Development WS Admin All Engineers Data Scientists
Role Scope Capabilities
Account Admin All workspaces Manage users, groups, workspaces, Unity Catalog metastore
Workspace Admin Single workspace Manage workspace settings, clusters, permissions
Metastore Admin Unity Catalog Manage catalogs, schemas, grants, data lineage
Users Workspace Access granted resources, run notebooks, submit jobs

Service Principals

Service principals are non-human identities used for automation, CI/CD pipelines, and programmatic access. They are the recommended authentication method for production workloads -- never use personal access tokens from human users in automated systems.

Python - Service Principal Authentication
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.iam import ServicePrincipal

# Authenticate using service principal (OAuth M2M)
w = WorkspaceClient(
    host="https://adb-1234567890.12.azuredatabricks.net",
    client_id="your-service-principal-client-id",
    client_secret="your-service-principal-secret"
)

# List clusters using the service principal identity
for cluster in w.clusters.list():
    print(f"{cluster.cluster_name}: {cluster.state}")

# Create a service principal via the Account API
from databricks.sdk import AccountClient

account = AccountClient(
    host="https://accounts.azuredatabricks.net",
    account_id="your-account-id",
    client_id="admin-sp-client-id",
    client_secret="admin-sp-secret"
)

# Create new service principal
sp = account.service_principals.create(
    display_name="etl-pipeline-sp",
    active=True
)
print(f"Created SP: {sp.application_id}")

# Grant workspace access to the service principal
account.workspace_assignment.update(
    workspace_id=1234567890,
    principal_id=sp.id,
    permissions=["USER"]
)

Token Authentication

Databricks supports multiple authentication methods. Personal access tokens (PATs) are the simplest but least secure. OAuth tokens with service principals are recommended for production.

Personal Access Tokens (PATs)

Generated per user, scoped to a workspace. Good for development and testing. Set short TTLs and rotate regularly. Never commit to source control.

OAuth (M2M)

Service principal authentication using OAuth client credentials flow. Recommended for CI/CD and production automation. Tokens auto-rotate.

Azure AD Tokens

For Azure Databricks, use Azure AD tokens with managed identities. Integrates with Azure RBAC for unified access control.

AWS IAM Credentials

For AWS Databricks, use instance profiles and IAM roles. Provides temporary credentials that auto-rotate without manual management.

Network Security (VPC/Private Link)

Network security in Databricks involves isolating the data plane within your cloud VPC/VNet and controlling all network traffic paths. For enterprise deployments, Private Link eliminates data traversal over the public internet.

Databricks Network Security Architecture
Public Internet IP Access Lists / WAF Databricks Control Plane Web App / API Cluster Manager Unity Catalog / Jobs Scheduler Private Your VPC / VNet Private Subnet Spark Workers Driver Nodes SQL Warehouses Storage Subnet S3 / ADLS Delta Tables Service Endpoints Network Security Groups / NACLs Key Vault / KMS Private DNS Zone

IP Access Lists

IP access lists restrict which IP addresses can access the Databricks workspace API and UI. This provides an additional security layer on top of network controls, ensuring only traffic from approved corporate networks can reach your workspace.

Python - Manage IP Access Lists
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.settings import CreateIpAccessList

w = WorkspaceClient()

# Create an IP access list (allow list)
ip_list = w.ip_access_lists.create(
    label="Corporate VPN",
    list_type="ALLOW",
    ip_addresses=[
        "10.0.0.0/8",         # Internal network
        "203.0.113.0/24",     # Office IP range
        "198.51.100.50/32",   # VPN gateway
    ]
)

# Enable IP access list enforcement
w.workspace_conf.set_status({
    "enableIpAccessLists": "true"
})

# List all configured IP access lists
for acl in w.ip_access_lists.list():
    print(f"{acl.label}: {acl.ip_addresses}")

Secrets Management

Databricks secrets provide a secure way to store and access sensitive information like API keys, database passwords, and connection strings. Secrets are stored in scopes and can be backed by Databricks-managed storage or external vaults like Azure Key Vault or AWS Secrets Manager.

Python - Secrets API
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Create a secret scope
w.secrets.create_scope(scope="production-secrets")

# Store a secret
w.secrets.put_secret(
    scope="production-secrets",
    key="database-password",
    string_value="super-secret-password-123"
)

# In a notebook, retrieve the secret
# Secrets are REDACTED in notebook output
password = dbutils.secrets.get(
    scope="production-secrets",
    key="database-password"
)

# Use the secret in a JDBC connection
df = spark.read.format("jdbc").options(
    url="jdbc:postgresql://db-host:5432/mydb",
    dbtable="public.users",
    user="etl_user",
    password=password
).load()

# Grant access to a group
w.secrets.put_acl(
    scope="production-secrets",
    principal="data-engineers",
    permission="READ"
)

# List secrets (keys only, values are never exposed)
for secret in w.secrets.list_secrets(scope="production-secrets"):
    print(f"Key: {secret.key}, Last Updated: {secret.last_updated_timestamp}")

Audit Logging

Databricks audit logs capture detailed records of all actions performed in your workspace, including who accessed what data, when clusters were created, and which notebooks were executed. These logs are essential for compliance, security investigations, and operational monitoring.

Audit Log Categories

  • Workspace events: Login, notebook access, cluster operations
  • Account events: User management, workspace provisioning
  • Unity Catalog events: Data access, grant changes, lineage queries
  • DBSQL events: SQL warehouse queries, dashboard access
  • Secrets events: Secret scope access, secret retrieval
SQL - Query Audit Logs
-- Audit logs delivered to your cloud storage (S3/ADLS/GCS)
-- Create an external table over the audit log files

CREATE TABLE IF NOT EXISTS audit.logs
USING JSON
LOCATION 's3://your-audit-bucket/databricks/audit-logs/';

-- Find all failed login attempts in the last 24 hours
SELECT
    timestamp,
    userIdentity.email AS user_email,
    sourceIPAddress,
    requestParams.user AS target_user,
    response.statusCode
FROM audit.logs
WHERE actionName = 'login'
    AND response.statusCode != 200
    AND timestamp >= current_timestamp() - INTERVAL 24 HOURS
ORDER BY timestamp DESC;

-- Track who accessed sensitive tables
SELECT
    timestamp,
    userIdentity.email,
    actionName,
    requestParams.full_name_arg AS table_name
FROM audit.logs
WHERE serviceName = 'unityCatalog'
    AND actionName IN ('getTable', 'readVolume')
    AND requestParams.full_name_arg LIKE '%pii%'
ORDER BY timestamp DESC;

Practice Problems

Problem 1: Secure Multi-Team Access Design

Hard

Your organization has three teams: Data Engineering, Data Science, and Business Analytics. Data Engineers need full access to all bronze/silver/gold tables. Data Scientists need read access to silver/gold plus write access to a sandbox catalog. Business Analysts need read-only access to gold tables. Design the Unity Catalog grants and workspace configuration.

Problem 2: Network Security Architecture

Hard

Your company requires that no data traffic traverses the public internet, all API access comes from the corporate VPN, and all encryption keys are customer-managed. Design the network and security architecture for a Databricks deployment on AWS.

Problem 3: Secret Rotation Strategy

Medium

Your team stores database credentials and API keys in Databricks secret scopes. Currently, secrets are rotated manually every 90 days, which sometimes gets forgotten. Design an automated secret rotation strategy.

Quick Reference

Security Cheat Sheet

Feature Purpose Best Practice
Service Principals Non-human authentication Use for all CI/CD and automation
Unity Catalog Fine-grained data access control Grant at schema level, not table level
Private Link Eliminate public internet traffic Required for regulated industries
IP Access Lists Restrict API/UI access by IP Allow only VPN/office CIDRs
Secrets Secure credential storage Use vault-backed scopes with auto-rotation
Audit Logs Compliance and investigation Deliver to cloud storage, query with SQL
Customer-Managed Keys Encryption key control Use for HIPAA/PCI workloads