Overview
Deployment Options
DataHub can be deployed via Docker Compose (development/testing), Kubernetes Helm Chart (production), or Acryl Cloud (managed SaaS). This tutorial covers all three approaches.
Core Concepts
Docker Compose
Single-machine deployment for development and testing. All services run in containers on one host. Not recommended for production.
Kubernetes / Helm
Production deployment with auto-scaling, high availability, and managed services integration. Official Helm chart provided.
Acryl Cloud
Fully managed DataHub SaaS by Acryl Data (founded by DataHub creators). Zero infrastructure management.
Prerequisites
Docker 20.10+, 8 GB RAM minimum, Python 3.8+ for CLI. For K8s: Helm 3, kubectl, ingress controller.
How It Works
Docker Compose Quickstart
# Install the DataHub CLI
pip install acryl-datahub
# Start DataHub (pulls images and starts all services)
datahub docker quickstart
# Verify all services are running
docker ps --format "table {{.Names}} {{.Status}} {{.Ports}}"
# Access the UI at http://localhost:9002
# Login: datahub / datahub
# Stop DataHub
datahub docker quickstart --stop
# Reset (delete all data)
datahub docker nuke
Kubernetes Helm Deployment
# Add the DataHub Helm repo
helm repo add datahub https://helm.datahubproject.io/
helm repo update
# Install prerequisites (if not using managed services)
helm install prerequisites datahub/datahub-prerequisites
# Install DataHub
helm install datahub datahub/datahub --set datahub-gms.resources.requests.memory="2Gi" --set datahub-frontend.resources.requests.memory="1Gi" --set global.elasticsearch.host="elasticsearch-master" --set global.kafka.bootstrap.server="kafka:9092" --set global.sql.datasource.host="mysql:3306"
# Verify pods are running
kubectl get pods -l app.kubernetes.io/instance=datahub
Hands-On Tutorial
Post-Installation Setup
# In datahub-frontend config, add OIDC for SSO:
auth:
oidc:
enabled: true
clientId: "your-client-id"
clientSecret: "${OIDC_SECRET}"
discoveryUri: "https://login.company.com/.well-known/openid-configuration"
userNameClaim: "email"
Security: Change Default Credentials
The default login (datahub/datahub) must be changed immediately in any non-local deployment. Enable OIDC/SAML SSO and disable native authentication for production.
Best Practices
- Use managed services for MySQL, Elasticsearch, and Kafka in production
- Set resource limits — GMS needs 2-4 GB RAM, Elasticsearch needs 4-8 GB
- Enable persistence — Map MySQL and Elasticsearch data to persistent volumes
- Configure backups — Daily MySQL dumps and Elasticsearch snapshots
- Use Ingress with TLS for external access to the frontend
Practice Problems
Practice 1
Design a production DataHub deployment on AWS using EKS, RDS (MySQL), OpenSearch, and MSK (Kafka). What instance types would you choose for a 500-user deployment?
Practice 2
Your DataHub Elasticsearch index is using 80% of disk. How do you scale it? What retention policies would you set?
Quick Reference
| Deployment | Best For | Complexity |
|---|---|---|
| Docker Compose | Development, demos, testing | Low |
| Kubernetes (Helm) | Production, scalability | Medium-High |
| Acryl Cloud | Zero-ops, enterprise features | Low (managed) |