Overview
DataHub Architecture at a Glance
DataHub follows a metadata-as-a-platform architecture. It separates metadata storage, search, event streaming, and API serving into independent components that communicate via Kafka. This enables real-time metadata updates, extensibility, and horizontal scalability.
Core Concepts
Metadata Service (GMS)
The core backend service that handles all metadata CRUD operations. Exposes GraphQL and REST APIs. Stores metadata in MySQL/PostgreSQL and syncs to Elasticsearch.
Kafka (MAE/MCE)
Metadata Change Events (MCE) flow through Kafka when metadata is written. Metadata Audit Events (MAE) are emitted after changes are committed. Powers real-time sync.
Elasticsearch
Search index that powers DataHub's full-text search and discovery. Automatically synced from the metadata store via Kafka consumers.
Frontend (React)
Single-page React application that communicates with GMS via GraphQL. Provides search, browse, lineage visualization, and governance UIs.
Metadata Model: Entities & Aspects
# DataHub's metadata model is built on Entities and Aspects
# Entity: A uniquely identifiable thing (Dataset, Dashboard, Pipeline)
# Aspect: A property bag attached to an entity
# Example: A "Dataset" entity has these aspects:
# - SchemaMetadata (column names, types)
# - Ownership (who owns this dataset)
# - GlobalTags (classification tags)
# - UpstreamLineage (which datasets feed into this one)
# - DatasetProperties (description, custom properties)
# - Status (active/deprecated)
# URN (Uniform Resource Name) uniquely identifies entities:
urn:li:dataset:(urn:li:dataPlatform:snowflake,mydb.schema.table,PROD)
How It Works
Write Path
When metadata is ingested or modified:
- Client sends metadata via GraphQL/REST API or ingestion framework
- GMS validates the metadata against the entity schema
- Metadata is persisted to MySQL/PostgreSQL
- A Metadata Change Event (MCE) is emitted to Kafka
- Consumers update Elasticsearch search index and graph index
Read Path
When a user searches or browses:
- Search queries go to Elasticsearch (full-text search)
- Entity detail pages query GMS via GraphQL
- Lineage queries traverse the graph index (Neo4j or built-in)
Hands-On Tutorial
# Query datasets via GraphQL
curl -X POST http://localhost:8080/api/graphql -H "Content-Type: application/json" -H "Authorization: Bearer <token>" -d '{
"query": "{ search(input: { type: DATASET, query: "revenue", start: 0, count: 10 }) { searchResults { entity { urn type } } } }"
}'
# DataHub Docker Compose services:
services:
datahub-gms: # Metadata Service (backend)
image: linkedin/datahub-gms
ports: ["8080:8080"]
depends_on: [mysql, elasticsearch, kafka]
datahub-frontend: # React UI
image: linkedin/datahub-frontend-react
ports: ["9002:9002"]
mysql: # Metadata store
image: mysql:8
elasticsearch: # Search index
image: elasticsearch:7.17
kafka: # Event bus
image: confluentinc/cp-kafka
zookeeper: # Kafka dependency
image: confluentinc/cp-zookeeper
schema-registry: # Avro schema registry
image: confluentinc/cp-schema-registry
Best Practices
- Production deployment: Use Kubernetes (Helm chart) instead of Docker Compose
- Database: Use managed MySQL/PostgreSQL (RDS, Cloud SQL) for reliability
- Search: Use managed Elasticsearch (OpenSearch Service) for scalability
- Kafka: Use managed Kafka (MSK, Confluent Cloud) to reduce operational burden
- Authentication: Integrate with your SSO (OIDC/SAML) from day one
Practice Problems
Practice 1
Draw the data flow for what happens when a user searches for "revenue" in DataHub. Which services are involved and in what order?
Practice 2
Your DataHub instance has 1M metadata entities. Search is getting slow. What would you investigate and optimize first?
Quick Reference
| Service | Port | Purpose |
|---|---|---|
| datahub-gms | 8080 | Metadata Service (GraphQL/REST) |
| datahub-frontend | 9002 | React Web UI |
| mysql | 3306 | Metadata persistence |
| elasticsearch | 9200 | Search index |
| kafka | 9092 | Event streaming |
| schema-registry | 8081 | Avro schema management |