Identity & Access Management
Why Security Matters in Databricks
The Problem: Data platforms handle sensitive business data, PII, and financial records. A misconfigured access control can lead to data breaches, compliance violations, and significant financial penalties.
The Solution: Databricks provides a layered security model with identity management, fine-grained access controls via Unity Catalog, network isolation, encryption, and comprehensive audit logging.
Real Impact: Organizations in regulated industries (healthcare, finance, government) use Databricks security features to achieve HIPAA, SOC 2, FedRAMP, and GDPR compliance.
Databricks identity management operates at two levels: account-level (across all workspaces) and workspace-level (within a single workspace). Understanding this hierarchy is essential for designing secure multi-team environments.
| Role | Scope | Capabilities |
|---|---|---|
| Account Admin | All workspaces | Manage users, groups, workspaces, Unity Catalog metastore |
| Workspace Admin | Single workspace | Manage workspace settings, clusters, permissions |
| Metastore Admin | Unity Catalog | Manage catalogs, schemas, grants, data lineage |
| Users | Workspace | Access granted resources, run notebooks, submit jobs |
Service Principals
Service principals are non-human identities used for automation, CI/CD pipelines, and programmatic access. They are the recommended authentication method for production workloads -- never use personal access tokens from human users in automated systems.
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.iam import ServicePrincipal
# Authenticate using service principal (OAuth M2M)
w = WorkspaceClient(
host="https://adb-1234567890.12.azuredatabricks.net",
client_id="your-service-principal-client-id",
client_secret="your-service-principal-secret"
)
# List clusters using the service principal identity
for cluster in w.clusters.list():
print(f"{cluster.cluster_name}: {cluster.state}")
# Create a service principal via the Account API
from databricks.sdk import AccountClient
account = AccountClient(
host="https://accounts.azuredatabricks.net",
account_id="your-account-id",
client_id="admin-sp-client-id",
client_secret="admin-sp-secret"
)
# Create new service principal
sp = account.service_principals.create(
display_name="etl-pipeline-sp",
active=True
)
print(f"Created SP: {sp.application_id}")
# Grant workspace access to the service principal
account.workspace_assignment.update(
workspace_id=1234567890,
principal_id=sp.id,
permissions=["USER"]
)
Token Authentication
Databricks supports multiple authentication methods. Personal access tokens (PATs) are the simplest but least secure. OAuth tokens with service principals are recommended for production.
Personal Access Tokens (PATs)
Generated per user, scoped to a workspace. Good for development and testing. Set short TTLs and rotate regularly. Never commit to source control.
OAuth (M2M)
Service principal authentication using OAuth client credentials flow. Recommended for CI/CD and production automation. Tokens auto-rotate.
Azure AD Tokens
For Azure Databricks, use Azure AD tokens with managed identities. Integrates with Azure RBAC for unified access control.
AWS IAM Credentials
For AWS Databricks, use instance profiles and IAM roles. Provides temporary credentials that auto-rotate without manual management.
Network Security (VPC/Private Link)
Network security in Databricks involves isolating the data plane within your cloud VPC/VNet and controlling all network traffic paths. For enterprise deployments, Private Link eliminates data traversal over the public internet.
IP Access Lists
IP access lists restrict which IP addresses can access the Databricks workspace API and UI. This provides an additional security layer on top of network controls, ensuring only traffic from approved corporate networks can reach your workspace.
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.settings import CreateIpAccessList
w = WorkspaceClient()
# Create an IP access list (allow list)
ip_list = w.ip_access_lists.create(
label="Corporate VPN",
list_type="ALLOW",
ip_addresses=[
"10.0.0.0/8", # Internal network
"203.0.113.0/24", # Office IP range
"198.51.100.50/32", # VPN gateway
]
)
# Enable IP access list enforcement
w.workspace_conf.set_status({
"enableIpAccessLists": "true"
})
# List all configured IP access lists
for acl in w.ip_access_lists.list():
print(f"{acl.label}: {acl.ip_addresses}")
Secrets Management
Databricks secrets provide a secure way to store and access sensitive information like API keys, database passwords, and connection strings. Secrets are stored in scopes and can be backed by Databricks-managed storage or external vaults like Azure Key Vault or AWS Secrets Manager.
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
# Create a secret scope
w.secrets.create_scope(scope="production-secrets")
# Store a secret
w.secrets.put_secret(
scope="production-secrets",
key="database-password",
string_value="super-secret-password-123"
)
# In a notebook, retrieve the secret
# Secrets are REDACTED in notebook output
password = dbutils.secrets.get(
scope="production-secrets",
key="database-password"
)
# Use the secret in a JDBC connection
df = spark.read.format("jdbc").options(
url="jdbc:postgresql://db-host:5432/mydb",
dbtable="public.users",
user="etl_user",
password=password
).load()
# Grant access to a group
w.secrets.put_acl(
scope="production-secrets",
principal="data-engineers",
permission="READ"
)
# List secrets (keys only, values are never exposed)
for secret in w.secrets.list_secrets(scope="production-secrets"):
print(f"Key: {secret.key}, Last Updated: {secret.last_updated_timestamp}")
Audit Logging
Databricks audit logs capture detailed records of all actions performed in your workspace, including who accessed what data, when clusters were created, and which notebooks were executed. These logs are essential for compliance, security investigations, and operational monitoring.
Audit Log Categories
- Workspace events: Login, notebook access, cluster operations
- Account events: User management, workspace provisioning
- Unity Catalog events: Data access, grant changes, lineage queries
- DBSQL events: SQL warehouse queries, dashboard access
- Secrets events: Secret scope access, secret retrieval
-- Audit logs delivered to your cloud storage (S3/ADLS/GCS)
-- Create an external table over the audit log files
CREATE TABLE IF NOT EXISTS audit.logs
USING JSON
LOCATION 's3://your-audit-bucket/databricks/audit-logs/';
-- Find all failed login attempts in the last 24 hours
SELECT
timestamp,
userIdentity.email AS user_email,
sourceIPAddress,
requestParams.user AS target_user,
response.statusCode
FROM audit.logs
WHERE actionName = 'login'
AND response.statusCode != 200
AND timestamp >= current_timestamp() - INTERVAL 24 HOURS
ORDER BY timestamp DESC;
-- Track who accessed sensitive tables
SELECT
timestamp,
userIdentity.email,
actionName,
requestParams.full_name_arg AS table_name
FROM audit.logs
WHERE serviceName = 'unityCatalog'
AND actionName IN ('getTable', 'readVolume')
AND requestParams.full_name_arg LIKE '%pii%'
ORDER BY timestamp DESC;
Practice Problems
Problem 1: Secure Multi-Team Access Design
HardYour organization has three teams: Data Engineering, Data Science, and Business Analytics. Data Engineers need full access to all bronze/silver/gold tables. Data Scientists need read access to silver/gold plus write access to a sandbox catalog. Business Analysts need read-only access to gold tables. Design the Unity Catalog grants and workspace configuration.
Problem 2: Network Security Architecture
HardYour company requires that no data traffic traverses the public internet, all API access comes from the corporate VPN, and all encryption keys are customer-managed. Design the network and security architecture for a Databricks deployment on AWS.
Problem 3: Secret Rotation Strategy
MediumYour team stores database credentials and API keys in Databricks secret scopes. Currently, secrets are rotated manually every 90 days, which sometimes gets forgotten. Design an automated secret rotation strategy.
Quick Reference
Security Cheat Sheet
| Feature | Purpose | Best Practice |
|---|---|---|
| Service Principals | Non-human authentication | Use for all CI/CD and automation |
| Unity Catalog | Fine-grained data access control | Grant at schema level, not table level |
| Private Link | Eliminate public internet traffic | Required for regulated industries |
| IP Access Lists | Restrict API/UI access by IP | Allow only VPN/office CIDRs |
| Secrets | Secure credential storage | Use vault-backed scopes with auto-rotation |
| Audit Logs | Compliance and investigation | Deliver to cloud storage, query with SQL |
| Customer-Managed Keys | Encryption key control | Use for HIPAA/PCI workloads |