DataHub Metadata Model & Entities

Overview

The Heart of DataHub

DataHub's metadata model is based on the Entity-Aspect pattern developed at LinkedIn. Every piece of metadata is either an Entity (a thing you want to track) or an Aspect (a property of that thing). Understanding this model is essential for effective DataHub usage and customization.

Core Concepts

Entity Types

Entity	URN Prefix	Description
Dataset	urn:li:dataset	Tables, views, topics, files
Dashboard	urn:li:dashboard	BI dashboards (Looker, Tableau)
Chart	urn:li:chart	Individual visualizations
DataFlow	urn:li:dataFlow	Pipelines (Airflow DAGs)
DataJob	urn:li:dataJob	Pipeline tasks (Airflow tasks)
MLModel	urn:li:mlModel	ML models
GlossaryTerm	urn:li:glossaryTerm	Business vocabulary
Domain	urn:li:domain	Business domains
CorpUser	urn:li:corpuser	Users
CorpGroup	urn:li:corpGroup	Teams/groups

URN Format

URN Examples

# Dataset URN format:
# urn:li:dataset:(platform, name, environment)
urn:li:dataset:(urn:li:dataPlatform:snowflake,mydb.analytics.revenue,PROD)
urn:li:dataset:(urn:li:dataPlatform:bigquery,project.dataset.table,PROD)
urn:li:dataset:(urn:li:dataPlatform:kafka,events.user_clicks,PROD)

# Dashboard URN:
urn:li:dashboard:(looker,dashboards.42)

# Pipeline URN:
urn:li:dataFlow:(airflow,revenue_pipeline,PROD)
urn:li:dataJob:(airflow,revenue_pipeline.transform_task,PROD)

How It Works

Aspects

Each entity has multiple aspects. Aspects are independently versioned and can be updated without affecting other aspects of the same entity.

Common Dataset Aspects

# SchemaMetadata — column definitions
{
  "fields": [
    { "fieldPath": "user_id", "type": "NUMBER", "description": "Primary key" },
    { "fieldPath": "email", "type": "STRING", "description": "User email" }
  ]
}

# Ownership — who owns this dataset
{
  "owners": [
    { "owner": "urn:li:corpuser:jane", "type": "DATAOWNER" }
  ]
}

# UpstreamLineage — what feeds into this dataset
{
  "upstreams": [
    { "dataset": "urn:li:dataset:(...raw_events...)", "type": "TRANSFORMED" }
  ]
}

Hands-On Tutorial

Query Metadata via Python SDK

from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
    DatasetPropertiesClass, OwnershipClass, OwnerClass
)

# Emit metadata for a dataset
emitter = DatahubRestEmitter("http://localhost:8080")

# Set dataset properties
dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:postgres,mydb.users,PROD)"
props = DatasetPropertiesClass(
    description="Core users table with PII",
    customProperties={"team": "platform", "sla": "99.9%"}
)
emitter.emit_mcp(dataset_urn, "datasetProperties", props)

# Set ownership
ownership = OwnershipClass(owners=[
    OwnerClass(owner="urn:li:corpuser:jane", type="DATAOWNER")
])
emitter.emit_mcp(dataset_urn, "ownership", ownership)

Best Practices

Use consistent URN naming conventions across your organization
Map your environments (DEV, STAGING, PROD) to DataHub fabric types
Define required aspects for each entity type (e.g., every dataset must have an owner)
Use custom properties for org-specific metadata before creating custom aspects

Practice Problems

Practice 1

Design a URN scheme for a company that uses Snowflake (3 environments), Kafka (2 clusters), and Looker (1 instance). How do you ensure uniqueness?

Practice 2

A dataset has 500 columns. Only 10 are frequently queried. How would you add column-level metadata (popularity, descriptions) efficiently using the aspect model?

Quick Reference

Aspect	Entity Types	Contains
SchemaMetadata	Dataset	Columns, types, descriptions
Ownership	All	Owners and their roles
GlobalTags	All	Classification tags
GlossaryTerms	All	Business glossary associations
UpstreamLineage	Dataset, Chart	Data source dependencies
Status	All	Active/deprecated
DatasetProperties	Dataset	Description, custom properties