What is DataHub? | LIZIU DataHub

Overview

Why DataHub Matters

DataHub is LinkedIn's open-source metadata platform that enables data discovery, data governance, and data observability. With 9,000+ GitHub stars and adoption at companies like LinkedIn, Acryl Data, Notion, and Saxo Bank, it has become the leading open-source data catalog.

As data ecosystems grow (hundreds of databases, warehouses, pipelines, dashboards), teams struggle to answer basic questions: "Where is this data?", "Who owns it?", "Is it trustworthy?", "What happens if I change this table?" DataHub answers all of these.

Think of DataHub like a search engine for your data. Just as Google indexes web pages and helps you find information, DataHub indexes your datasets, dashboards, pipelines, and ML models — then helps your team find, understand, and trust their data.

Core Concepts

Metadata

Information about your data: schemas, ownership, lineage, tags, glossary terms, quality metrics. DataHub captures and organizes all of it.

Entities

The things DataHub tracks: Datasets, Dashboards, Data Pipelines, ML Models, Users, Groups, Domains, Glossary Terms.

Aspects

Properties of an entity: a Dataset has aspects like Schema, Ownership, Tags, Lineage, Statistics, Data Quality.

Ingestion

The process of pulling metadata from source systems (Snowflake, dbt, Airflow, Kafka, etc.) into DataHub.

What Problems Does DataHub Solve?

Data discovery: Find the right dataset in seconds, not hours
Data governance: Know who owns data, what policies apply, who has access
Data lineage: See how data flows from source to dashboard, track impact of changes
Data quality: Monitor freshness, completeness, and accuracy of datasets
Collaboration: Document data with descriptions, tags, glossary terms

How It Works

DataHub vs Other Data Catalogs

DataHub vs Amundsen: DataHub has richer metadata model, real-time ingestion, and better governance features. Amundsen is simpler but less extensible.
DataHub vs OpenMetadata: Similar feature set, but DataHub has larger community and more production deployments. OpenMetadata has a more polished UI out of the box.
DataHub vs Commercial (Atlan, Collibra, Alation): DataHub is free, open-source, and highly customizable. Commercial tools offer more polished UX and enterprise support.

Key Architecture Components

Metadata Store: MySQL/PostgreSQL for structured metadata + Elasticsearch for search
Kafka: Event-driven metadata changes (real-time updates)
GraphQL API: Primary interface for reading/writing metadata
React Frontend: Web UI for browsing, searching, and managing metadata
Ingestion Framework: Python-based connectors for 50+ data sources

Hands-On Tutorial

Quick Start with Docker

# Clone and start DataHub
git clone https://github.com/datahub-project/datahub.git
cd datahub/docker/quickstart
./quickstart.sh

# DataHub UI available at http://localhost:9002
# Default credentials: datahub / datahub

# Install the Python CLI
pip install acryl-datahub

# Verify installation
datahub version

Ingest metadata from a PostgreSQL database

# Create an ingestion recipe (YAML)
# postgres_recipe.yml
source:
  type: postgres
  config:
    host_port: "localhost:5432"
    database: "my_database"
    username: "datahub"
    password: "${POSTGRES_PASSWORD}"

sink:
  type: datahub-rest
  config:
    server: "http://localhost:8080"

# Run ingestion
datahub ingest -c postgres_recipe.yml

Best Practices

Getting Started Tips

Start with automated ingestion from your most critical data sources first
Assign data owners early — governance works best with clear accountability
Use domains to organize metadata by business area (Marketing, Finance, Engineering)
Set up business glossary terms for commonly misunderstood concepts
Enable lineage from your orchestrator (Airflow/dbt) to show data flow

Practice Problems

Practice 1: Catalog Assessment

Your company has 500 datasets across 8 databases, 50 Airflow DAGs, and 200 dashboards. No one knows who owns what. Design a rollout plan for DataHub adoption: what do you ingest first, how do you assign ownership, and how do you drive adoption?

Practice 2: Tool Comparison

Your CTO asks you to compare DataHub vs Atlan vs OpenMetadata for a 200-person data org. What criteria would you evaluate? What would you recommend and why?

Quick Reference

Component	Technology	Purpose
Metadata Store	MySQL / PostgreSQL	Persistent metadata storage
Search Index	Elasticsearch	Full-text search over metadata
Event Bus	Kafka	Real-time metadata change events
API	GraphQL + REST	Read/write metadata programmatically
Frontend	React	Web UI for browsing and managing
Ingestion	Python SDK	50+ source connectors
CLI	datahub CLI	Command-line management