Overview
The Heart of DataHub
DataHub's metadata model is based on the Entity-Aspect pattern developed at LinkedIn. Every piece of metadata is either an Entity (a thing you want to track) or an Aspect (a property of that thing). Understanding this model is essential for effective DataHub usage and customization.
Core Concepts
Entity Types
| Entity | URN Prefix | Description |
|---|---|---|
| Dataset | urn:li:dataset | Tables, views, topics, files |
| Dashboard | urn:li:dashboard | BI dashboards (Looker, Tableau) |
| Chart | urn:li:chart | Individual visualizations |
| DataFlow | urn:li:dataFlow | Pipelines (Airflow DAGs) |
| DataJob | urn:li:dataJob | Pipeline tasks (Airflow tasks) |
| MLModel | urn:li:mlModel | ML models |
| GlossaryTerm | urn:li:glossaryTerm | Business vocabulary |
| Domain | urn:li:domain | Business domains |
| CorpUser | urn:li:corpuser | Users |
| CorpGroup | urn:li:corpGroup | Teams/groups |
URN Format
URN Examples
# Dataset URN format:
# urn:li:dataset:(platform, name, environment)
urn:li:dataset:(urn:li:dataPlatform:snowflake,mydb.analytics.revenue,PROD)
urn:li:dataset:(urn:li:dataPlatform:bigquery,project.dataset.table,PROD)
urn:li:dataset:(urn:li:dataPlatform:kafka,events.user_clicks,PROD)
# Dashboard URN:
urn:li:dashboard:(looker,dashboards.42)
# Pipeline URN:
urn:li:dataFlow:(airflow,revenue_pipeline,PROD)
urn:li:dataJob:(airflow,revenue_pipeline.transform_task,PROD)
How It Works
Aspects
Each entity has multiple aspects. Aspects are independently versioned and can be updated without affecting other aspects of the same entity.
Common Dataset Aspects
# SchemaMetadata — column definitions
{
"fields": [
{ "fieldPath": "user_id", "type": "NUMBER", "description": "Primary key" },
{ "fieldPath": "email", "type": "STRING", "description": "User email" }
]
}
# Ownership — who owns this dataset
{
"owners": [
{ "owner": "urn:li:corpuser:jane", "type": "DATAOWNER" }
]
}
# UpstreamLineage — what feeds into this dataset
{
"upstreams": [
{ "dataset": "urn:li:dataset:(...raw_events...)", "type": "TRANSFORMED" }
]
}
Hands-On Tutorial
Query Metadata via Python SDK
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
DatasetPropertiesClass, OwnershipClass, OwnerClass
)
# Emit metadata for a dataset
emitter = DatahubRestEmitter("http://localhost:8080")
# Set dataset properties
dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:postgres,mydb.users,PROD)"
props = DatasetPropertiesClass(
description="Core users table with PII",
customProperties={"team": "platform", "sla": "99.9%"}
)
emitter.emit_mcp(dataset_urn, "datasetProperties", props)
# Set ownership
ownership = OwnershipClass(owners=[
OwnerClass(owner="urn:li:corpuser:jane", type="DATAOWNER")
])
emitter.emit_mcp(dataset_urn, "ownership", ownership)
Best Practices
- Use consistent URN naming conventions across your organization
- Map your environments (DEV, STAGING, PROD) to DataHub fabric types
- Define required aspects for each entity type (e.g., every dataset must have an owner)
- Use custom properties for org-specific metadata before creating custom aspects
Practice Problems
Practice 1
Design a URN scheme for a company that uses Snowflake (3 environments), Kafka (2 clusters), and Looker (1 instance). How do you ensure uniqueness?
Practice 2
A dataset has 500 columns. Only 10 are frequently queried. How would you add column-level metadata (popularity, descriptions) efficiently using the aspect model?
Quick Reference
| Aspect | Entity Types | Contains |
|---|---|---|
| SchemaMetadata | Dataset | Columns, types, descriptions |
| Ownership | All | Owners and their roles |
| GlobalTags | All | Classification tags |
| GlossaryTerms | All | Business glossary associations |
| UpstreamLineage | Dataset, Chart | Data source dependencies |
| Status | All | Active/deprecated |
| DatasetProperties | Dataset | Description, custom properties |