Overview
Automated Metadata Collection
DataHub's ingestion framework pulls metadata from 50+ sources using a recipe pattern: YAML config with source, transformers, and sink.
Core Concepts
Recipe
YAML config defining source, transformers, and sink.
Source
Connector for Snowflake, BigQuery, Kafka, dbt, etc.
Transformer
Middleware to modify metadata in-flight.
Sink
Destination: datahub-rest or datahub-kafka.
How It Works
Ingestion Recipe
# snowflake_recipe.yml
source:
type: snowflake
config:
account_id: "mycompany.us-east-1"
username: "DATAHUB_USER"
password: "${SNOWFLAKE_PASSWORD}"
include_table_lineage: true
transformers:
- type: simple_add_dataset_ownership
config:
owner_urns: ["urn:li:corpuser:data-team"]
sink:
type: datahub-rest
config:
server: "http://datahub-gms:8080"Hands-On Tutorial
Run Ingestion
datahub ingest -c snowflake_recipe.yml
datahub ingest -c snowflake_recipe.yml --dry-run # PreviewBest Practices
- Schedule via Airflow or cron
- Use stateful ingestion for incremental updates
- Enable table + column lineage
- Auto-tag PII columns via transformers
Practice Problems
Practice 1
Write a recipe for BigQuery that only ingests *_fact and *_dim tables with auto-tagging.
Quick Reference
| Source | Type | Lineage |
|---|---|---|
| Snowflake | snowflake | Yes |
| BigQuery | bigquery | Yes |
| dbt | dbt-cloud | Yes |
| Airflow | airflow | Yes |
| Kafka | kafka | No |