What is DataHub?

Easy 15 min read

Overview

Why DataHub Matters

DataHub is LinkedIn's open-source metadata platform that enables data discovery, data governance, and data observability. With 9,000+ GitHub stars and adoption at companies like LinkedIn, Acryl Data, Notion, and Saxo Bank, it has become the leading open-source data catalog.

As data ecosystems grow (hundreds of databases, warehouses, pipelines, dashboards), teams struggle to answer basic questions: "Where is this data?", "Who owns it?", "Is it trustworthy?", "What happens if I change this table?" DataHub answers all of these.

Think of DataHub like a search engine for your data. Just as Google indexes web pages and helps you find information, DataHub indexes your datasets, dashboards, pipelines, and ML models — then helps your team find, understand, and trust their data.

Core Concepts

Metadata

Information about your data: schemas, ownership, lineage, tags, glossary terms, quality metrics. DataHub captures and organizes all of it.

Entities

The things DataHub tracks: Datasets, Dashboards, Data Pipelines, ML Models, Users, Groups, Domains, Glossary Terms.

Aspects

Properties of an entity: a Dataset has aspects like Schema, Ownership, Tags, Lineage, Statistics, Data Quality.

Ingestion

The process of pulling metadata from source systems (Snowflake, dbt, Airflow, Kafka, etc.) into DataHub.

What Problems Does DataHub Solve?

How It Works

DataHub vs Other Data Catalogs

DataHub vs Amundsen: DataHub has richer metadata model, real-time ingestion, and better governance features. Amundsen is simpler but less extensible.
DataHub vs OpenMetadata: Similar feature set, but DataHub has larger community and more production deployments. OpenMetadata has a more polished UI out of the box.
DataHub vs Commercial (Atlan, Collibra, Alation): DataHub is free, open-source, and highly customizable. Commercial tools offer more polished UX and enterprise support.

Key Architecture Components

Hands-On Tutorial

Quick Start with Docker
# Clone and start DataHub
git clone https://github.com/datahub-project/datahub.git
cd datahub/docker/quickstart
./quickstart.sh

# DataHub UI available at http://localhost:9002
# Default credentials: datahub / datahub

# Install the Python CLI
pip install acryl-datahub

# Verify installation
datahub version
Ingest metadata from a PostgreSQL database
# Create an ingestion recipe (YAML)
# postgres_recipe.yml
source:
  type: postgres
  config:
    host_port: "localhost:5432"
    database: "my_database"
    username: "datahub"
    password: "${POSTGRES_PASSWORD}"

sink:
  type: datahub-rest
  config:
    server: "http://localhost:8080"

# Run ingestion
datahub ingest -c postgres_recipe.yml

Best Practices

Getting Started Tips

  • Start with automated ingestion from your most critical data sources first
  • Assign data owners early — governance works best with clear accountability
  • Use domains to organize metadata by business area (Marketing, Finance, Engineering)
  • Set up business glossary terms for commonly misunderstood concepts
  • Enable lineage from your orchestrator (Airflow/dbt) to show data flow

Practice Problems

Practice 1: Catalog Assessment

Your company has 500 datasets across 8 databases, 50 Airflow DAGs, and 200 dashboards. No one knows who owns what. Design a rollout plan for DataHub adoption: what do you ingest first, how do you assign ownership, and how do you drive adoption?

Practice 2: Tool Comparison

Your CTO asks you to compare DataHub vs Atlan vs OpenMetadata for a 200-person data org. What criteria would you evaluate? What would you recommend and why?

Quick Reference

ComponentTechnologyPurpose
Metadata StoreMySQL / PostgreSQLPersistent metadata storage
Search IndexElasticsearchFull-text search over metadata
Event BusKafkaReal-time metadata change events
APIGraphQL + RESTRead/write metadata programmatically
FrontendReactWeb UI for browsing and managing
IngestionPython SDK50+ source connectors
CLIdatahub CLICommand-line management