Metadata Model & Entities

Medium 25 min read

Overview

The Heart of DataHub

DataHub's metadata model is based on the Entity-Aspect pattern developed at LinkedIn. Every piece of metadata is either an Entity (a thing you want to track) or an Aspect (a property of that thing). Understanding this model is essential for effective DataHub usage and customization.

Core Concepts

Entity Types

EntityURN PrefixDescription
Dataseturn:li:datasetTables, views, topics, files
Dashboardurn:li:dashboardBI dashboards (Looker, Tableau)
Charturn:li:chartIndividual visualizations
DataFlowurn:li:dataFlowPipelines (Airflow DAGs)
DataJoburn:li:dataJobPipeline tasks (Airflow tasks)
MLModelurn:li:mlModelML models
GlossaryTermurn:li:glossaryTermBusiness vocabulary
Domainurn:li:domainBusiness domains
CorpUserurn:li:corpuserUsers
CorpGroupurn:li:corpGroupTeams/groups

URN Format

URN Examples
# Dataset URN format:
# urn:li:dataset:(platform, name, environment)
urn:li:dataset:(urn:li:dataPlatform:snowflake,mydb.analytics.revenue,PROD)
urn:li:dataset:(urn:li:dataPlatform:bigquery,project.dataset.table,PROD)
urn:li:dataset:(urn:li:dataPlatform:kafka,events.user_clicks,PROD)

# Dashboard URN:
urn:li:dashboard:(looker,dashboards.42)

# Pipeline URN:
urn:li:dataFlow:(airflow,revenue_pipeline,PROD)
urn:li:dataJob:(airflow,revenue_pipeline.transform_task,PROD)

How It Works

Aspects

Each entity has multiple aspects. Aspects are independently versioned and can be updated without affecting other aspects of the same entity.

Common Dataset Aspects
# SchemaMetadata — column definitions
{
  "fields": [
    { "fieldPath": "user_id", "type": "NUMBER", "description": "Primary key" },
    { "fieldPath": "email", "type": "STRING", "description": "User email" }
  ]
}

# Ownership — who owns this dataset
{
  "owners": [
    { "owner": "urn:li:corpuser:jane", "type": "DATAOWNER" }
  ]
}

# UpstreamLineage — what feeds into this dataset
{
  "upstreams": [
    { "dataset": "urn:li:dataset:(...raw_events...)", "type": "TRANSFORMED" }
  ]
}

Hands-On Tutorial

Query Metadata via Python SDK
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
    DatasetPropertiesClass, OwnershipClass, OwnerClass
)

# Emit metadata for a dataset
emitter = DatahubRestEmitter("http://localhost:8080")

# Set dataset properties
dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:postgres,mydb.users,PROD)"
props = DatasetPropertiesClass(
    description="Core users table with PII",
    customProperties={"team": "platform", "sla": "99.9%"}
)
emitter.emit_mcp(dataset_urn, "datasetProperties", props)

# Set ownership
ownership = OwnershipClass(owners=[
    OwnerClass(owner="urn:li:corpuser:jane", type="DATAOWNER")
])
emitter.emit_mcp(dataset_urn, "ownership", ownership)

Best Practices

Practice Problems

Practice 1

Design a URN scheme for a company that uses Snowflake (3 environments), Kafka (2 clusters), and Looker (1 instance). How do you ensure uniqueness?

Practice 2

A dataset has 500 columns. Only 10 are frequently queried. How would you add column-level metadata (popularity, descriptions) efficiently using the aspect model?

Quick Reference

AspectEntity TypesContains
SchemaMetadataDatasetColumns, types, descriptions
OwnershipAllOwners and their roles
GlobalTagsAllClassification tags
GlossaryTermsAllBusiness glossary associations
UpstreamLineageDataset, ChartData source dependencies
StatusAllActive/deprecated
DatasetPropertiesDatasetDescription, custom properties