Architecture & Components

Easy 20 min read

Overview

DataHub Architecture at a Glance

DataHub follows a metadata-as-a-platform architecture. It separates metadata storage, search, event streaming, and API serving into independent components that communicate via Kafka. This enables real-time metadata updates, extensibility, and horizontal scalability.

Core Concepts

Metadata Service (GMS)

The core backend service that handles all metadata CRUD operations. Exposes GraphQL and REST APIs. Stores metadata in MySQL/PostgreSQL and syncs to Elasticsearch.

Kafka (MAE/MCE)

Metadata Change Events (MCE) flow through Kafka when metadata is written. Metadata Audit Events (MAE) are emitted after changes are committed. Powers real-time sync.

Elasticsearch

Search index that powers DataHub's full-text search and discovery. Automatically synced from the metadata store via Kafka consumers.

Frontend (React)

Single-page React application that communicates with GMS via GraphQL. Provides search, browse, lineage visualization, and governance UIs.

Metadata Model: Entities & Aspects

Entity-Aspect Model
# DataHub's metadata model is built on Entities and Aspects

# Entity: A uniquely identifiable thing (Dataset, Dashboard, Pipeline)
# Aspect: A property bag attached to an entity

# Example: A "Dataset" entity has these aspects:
# - SchemaMetadata (column names, types)
# - Ownership (who owns this dataset)
# - GlobalTags (classification tags)
# - UpstreamLineage (which datasets feed into this one)
# - DatasetProperties (description, custom properties)
# - Status (active/deprecated)

# URN (Uniform Resource Name) uniquely identifies entities:
urn:li:dataset:(urn:li:dataPlatform:snowflake,mydb.schema.table,PROD)

How It Works

Write Path

When metadata is ingested or modified:

  1. Client sends metadata via GraphQL/REST API or ingestion framework
  2. GMS validates the metadata against the entity schema
  3. Metadata is persisted to MySQL/PostgreSQL
  4. A Metadata Change Event (MCE) is emitted to Kafka
  5. Consumers update Elasticsearch search index and graph index

Read Path

When a user searches or browses:

  1. Search queries go to Elasticsearch (full-text search)
  2. Entity detail pages query GMS via GraphQL
  3. Lineage queries traverse the graph index (Neo4j or built-in)

Hands-On Tutorial

Explore the GraphQL API
# Query datasets via GraphQL
curl -X POST http://localhost:8080/api/graphql   -H "Content-Type: application/json"   -H "Authorization: Bearer <token>"   -d '{
    "query": "{ search(input: { type: DATASET, query: "revenue", start: 0, count: 10 }) { searchResults { entity { urn type } } } }"
  }'
Docker Compose Architecture
# DataHub Docker Compose services:
services:
  datahub-gms:        # Metadata Service (backend)
    image: linkedin/datahub-gms
    ports: ["8080:8080"]
    depends_on: [mysql, elasticsearch, kafka]

  datahub-frontend:   # React UI
    image: linkedin/datahub-frontend-react
    ports: ["9002:9002"]

  mysql:              # Metadata store
    image: mysql:8
  elasticsearch:      # Search index
    image: elasticsearch:7.17
  kafka:              # Event bus
    image: confluentinc/cp-kafka
  zookeeper:          # Kafka dependency
    image: confluentinc/cp-zookeeper
  schema-registry:    # Avro schema registry
    image: confluentinc/cp-schema-registry

Best Practices

Practice Problems

Practice 1

Draw the data flow for what happens when a user searches for "revenue" in DataHub. Which services are involved and in what order?

Practice 2

Your DataHub instance has 1M metadata entities. Search is getting slow. What would you investigate and optimize first?

Quick Reference

ServicePortPurpose
datahub-gms8080Metadata Service (GraphQL/REST)
datahub-frontend9002React Web UI
mysql3306Metadata persistence
elasticsearch9200Search index
kafka9092Event streaming
schema-registry8081Avro schema management