Quickstart & Deployment

Easy 25 min read

Overview

Deployment Options

DataHub can be deployed via Docker Compose (development/testing), Kubernetes Helm Chart (production), or Acryl Cloud (managed SaaS). This tutorial covers all three approaches.

Core Concepts

Docker Compose

Single-machine deployment for development and testing. All services run in containers on one host. Not recommended for production.

Kubernetes / Helm

Production deployment with auto-scaling, high availability, and managed services integration. Official Helm chart provided.

Acryl Cloud

Fully managed DataHub SaaS by Acryl Data (founded by DataHub creators). Zero infrastructure management.

Prerequisites

Docker 20.10+, 8 GB RAM minimum, Python 3.8+ for CLI. For K8s: Helm 3, kubectl, ingress controller.

How It Works

Docker Compose Quickstart

Local Development Setup
# Install the DataHub CLI
pip install acryl-datahub

# Start DataHub (pulls images and starts all services)
datahub docker quickstart

# Verify all services are running
docker ps --format "table {{.Names}}	{{.Status}}	{{.Ports}}"

# Access the UI at http://localhost:9002
# Login: datahub / datahub

# Stop DataHub
datahub docker quickstart --stop

# Reset (delete all data)
datahub docker nuke

Kubernetes Helm Deployment

Production Kubernetes Setup
# Add the DataHub Helm repo
helm repo add datahub https://helm.datahubproject.io/
helm repo update

# Install prerequisites (if not using managed services)
helm install prerequisites datahub/datahub-prerequisites

# Install DataHub
helm install datahub datahub/datahub   --set datahub-gms.resources.requests.memory="2Gi"   --set datahub-frontend.resources.requests.memory="1Gi"   --set global.elasticsearch.host="elasticsearch-master"   --set global.kafka.bootstrap.server="kafka:9092"   --set global.sql.datasource.host="mysql:3306"

# Verify pods are running
kubectl get pods -l app.kubernetes.io/instance=datahub

Hands-On Tutorial

Post-Installation Setup

Configure Authentication (OIDC)
# In datahub-frontend config, add OIDC for SSO:
auth:
  oidc:
    enabled: true
    clientId: "your-client-id"
    clientSecret: "${OIDC_SECRET}"
    discoveryUri: "https://login.company.com/.well-known/openid-configuration"
    userNameClaim: "email"

Security: Change Default Credentials

The default login (datahub/datahub) must be changed immediately in any non-local deployment. Enable OIDC/SAML SSO and disable native authentication for production.

Best Practices

Practice Problems

Practice 1

Design a production DataHub deployment on AWS using EKS, RDS (MySQL), OpenSearch, and MSK (Kafka). What instance types would you choose for a 500-user deployment?

Practice 2

Your DataHub Elasticsearch index is using 80% of disk. How do you scale it? What retention policies would you set?

Quick Reference

DeploymentBest ForComplexity
Docker ComposeDevelopment, demos, testingLow
Kubernetes (Helm)Production, scalabilityMedium-High
Acryl CloudZero-ops, enterprise featuresLow (managed)