Overview
Why DataHub Matters
DataHub is LinkedIn's open-source metadata platform that enables data discovery, data governance, and data observability. With 9,000+ GitHub stars and adoption at companies like LinkedIn, Acryl Data, Notion, and Saxo Bank, it has become the leading open-source data catalog.
As data ecosystems grow (hundreds of databases, warehouses, pipelines, dashboards), teams struggle to answer basic questions: "Where is this data?", "Who owns it?", "Is it trustworthy?", "What happens if I change this table?" DataHub answers all of these.
Think of DataHub like a search engine for your data. Just as Google indexes web pages and helps you find information, DataHub indexes your datasets, dashboards, pipelines, and ML models — then helps your team find, understand, and trust their data.
Core Concepts
Metadata
Information about your data: schemas, ownership, lineage, tags, glossary terms, quality metrics. DataHub captures and organizes all of it.
Entities
The things DataHub tracks: Datasets, Dashboards, Data Pipelines, ML Models, Users, Groups, Domains, Glossary Terms.
Aspects
Properties of an entity: a Dataset has aspects like Schema, Ownership, Tags, Lineage, Statistics, Data Quality.
Ingestion
The process of pulling metadata from source systems (Snowflake, dbt, Airflow, Kafka, etc.) into DataHub.
What Problems Does DataHub Solve?
- Data discovery: Find the right dataset in seconds, not hours
- Data governance: Know who owns data, what policies apply, who has access
- Data lineage: See how data flows from source to dashboard, track impact of changes
- Data quality: Monitor freshness, completeness, and accuracy of datasets
- Collaboration: Document data with descriptions, tags, glossary terms
How It Works
DataHub vs Other Data Catalogs
DataHub vs Amundsen: DataHub has richer metadata model, real-time ingestion, and better governance features. Amundsen is simpler but less extensible.
DataHub vs OpenMetadata: Similar feature set, but DataHub has larger community and more production deployments. OpenMetadata has a more polished UI out of the box.
DataHub vs Commercial (Atlan, Collibra, Alation): DataHub is free, open-source, and highly customizable. Commercial tools offer more polished UX and enterprise support.
Key Architecture Components
- Metadata Store: MySQL/PostgreSQL for structured metadata + Elasticsearch for search
- Kafka: Event-driven metadata changes (real-time updates)
- GraphQL API: Primary interface for reading/writing metadata
- React Frontend: Web UI for browsing, searching, and managing metadata
- Ingestion Framework: Python-based connectors for 50+ data sources
Hands-On Tutorial
# Clone and start DataHub
git clone https://github.com/datahub-project/datahub.git
cd datahub/docker/quickstart
./quickstart.sh
# DataHub UI available at http://localhost:9002
# Default credentials: datahub / datahub
# Install the Python CLI
pip install acryl-datahub
# Verify installation
datahub version
# Create an ingestion recipe (YAML)
# postgres_recipe.yml
source:
type: postgres
config:
host_port: "localhost:5432"
database: "my_database"
username: "datahub"
password: "${POSTGRES_PASSWORD}"
sink:
type: datahub-rest
config:
server: "http://localhost:8080"
# Run ingestion
datahub ingest -c postgres_recipe.yml
Best Practices
Getting Started Tips
- Start with automated ingestion from your most critical data sources first
- Assign data owners early — governance works best with clear accountability
- Use domains to organize metadata by business area (Marketing, Finance, Engineering)
- Set up business glossary terms for commonly misunderstood concepts
- Enable lineage from your orchestrator (Airflow/dbt) to show data flow
Practice Problems
Practice 1: Catalog Assessment
Your company has 500 datasets across 8 databases, 50 Airflow DAGs, and 200 dashboards. No one knows who owns what. Design a rollout plan for DataHub adoption: what do you ingest first, how do you assign ownership, and how do you drive adoption?
Practice 2: Tool Comparison
Your CTO asks you to compare DataHub vs Atlan vs OpenMetadata for a 200-person data org. What criteria would you evaluate? What would you recommend and why?
Quick Reference
| Component | Technology | Purpose |
|---|---|---|
| Metadata Store | MySQL / PostgreSQL | Persistent metadata storage |
| Search Index | Elasticsearch | Full-text search over metadata |
| Event Bus | Kafka | Real-time metadata change events |
| API | GraphQL + REST | Read/write metadata programmatically |
| Frontend | React | Web UI for browsing and managing |
| Ingestion | Python SDK | 50+ source connectors |
| CLI | datahub CLI | Command-line management |