DataHub Architecture & Components

Overview

DataHub Architecture at a Glance

DataHub follows a metadata-as-a-platform architecture. It separates metadata storage, search, event streaming, and API serving into independent components that communicate via Kafka. This enables real-time metadata updates, extensibility, and horizontal scalability.

Core Concepts

Metadata Service (GMS)

The core backend service that handles all metadata CRUD operations. Exposes GraphQL and REST APIs. Stores metadata in MySQL/PostgreSQL and syncs to Elasticsearch.

Kafka (MAE/MCE)

Metadata Change Events (MCE) flow through Kafka when metadata is written. Metadata Audit Events (MAE) are emitted after changes are committed. Powers real-time sync.

Elasticsearch

Search index that powers DataHub's full-text search and discovery. Automatically synced from the metadata store via Kafka consumers.

Frontend (React)

Single-page React application that communicates with GMS via GraphQL. Provides search, browse, lineage visualization, and governance UIs.

Metadata Model: Entities & Aspects

Entity-Aspect Model

# DataHub's metadata model is built on Entities and Aspects

# Entity: A uniquely identifiable thing (Dataset, Dashboard, Pipeline)
# Aspect: A property bag attached to an entity

# Example: A "Dataset" entity has these aspects:
# - SchemaMetadata (column names, types)
# - Ownership (who owns this dataset)
# - GlobalTags (classification tags)
# - UpstreamLineage (which datasets feed into this one)
# - DatasetProperties (description, custom properties)
# - Status (active/deprecated)

# URN (Uniform Resource Name) uniquely identifies entities:
urn:li:dataset:(urn:li:dataPlatform:snowflake,mydb.schema.table,PROD)

How It Works

Write Path

When metadata is ingested or modified:

Client sends metadata via GraphQL/REST API or ingestion framework
GMS validates the metadata against the entity schema
Metadata is persisted to MySQL/PostgreSQL
A Metadata Change Event (MCE) is emitted to Kafka
Consumers update Elasticsearch search index and graph index

Read Path

When a user searches or browses:

Search queries go to Elasticsearch (full-text search)
Entity detail pages query GMS via GraphQL
Lineage queries traverse the graph index (Neo4j or built-in)

Hands-On Tutorial

Explore the GraphQL API

# Query datasets via GraphQL
curl -X POST http://localhost:8080/api/graphql   -H "Content-Type: application/json"   -H "Authorization: Bearer <token>"   -d '{
    "query": "{ search(input: { type: DATASET, query: "revenue", start: 0, count: 10 }) { searchResults { entity { urn type } } } }"
  }'

Docker Compose Architecture

# DataHub Docker Compose services:
services:
  datahub-gms:        # Metadata Service (backend)
    image: linkedin/datahub-gms
    ports: ["8080:8080"]
    depends_on: [mysql, elasticsearch, kafka]

  datahub-frontend:   # React UI
    image: linkedin/datahub-frontend-react
    ports: ["9002:9002"]

  mysql:              # Metadata store
    image: mysql:8
  elasticsearch:      # Search index
    image: elasticsearch:7.17
  kafka:              # Event bus
    image: confluentinc/cp-kafka
  zookeeper:          # Kafka dependency
    image: confluentinc/cp-zookeeper
  schema-registry:    # Avro schema registry
    image: confluentinc/cp-schema-registry

Best Practices

Production deployment: Use Kubernetes (Helm chart) instead of Docker Compose
Database: Use managed MySQL/PostgreSQL (RDS, Cloud SQL) for reliability
Search: Use managed Elasticsearch (OpenSearch Service) for scalability
Kafka: Use managed Kafka (MSK, Confluent Cloud) to reduce operational burden
Authentication: Integrate with your SSO (OIDC/SAML) from day one

Practice Problems

Practice 1

Draw the data flow for what happens when a user searches for "revenue" in DataHub. Which services are involved and in what order?

Practice 2

Your DataHub instance has 1M metadata entities. Search is getting slow. What would you investigate and optimize first?

Quick Reference

Service	Port	Purpose
datahub-gms	8080	Metadata Service (GraphQL/REST)
datahub-frontend	9002	React Web UI
mysql	3306	Metadata persistence
elasticsearch	9200	Search index
kafka	9092	Event streaming
schema-registry	8081	Avro schema management