Ingestion Framework | LIZIU DataHub

Overview

Automated Metadata Collection

DataHub's ingestion framework pulls metadata from 50+ sources using a recipe pattern: YAML config with source, transformers, and sink.

Core Concepts

Recipe

YAML config defining source, transformers, and sink.

Source

Connector for Snowflake, BigQuery, Kafka, dbt, etc.

Transformer

Middleware to modify metadata in-flight.

Sink

Destination: datahub-rest or datahub-kafka.

How It Works

Ingestion Recipe

# snowflake_recipe.yml
source:
  type: snowflake
  config:
    account_id: "mycompany.us-east-1"
    username: "DATAHUB_USER"
    password: "${SNOWFLAKE_PASSWORD}"
    include_table_lineage: true
transformers:
  - type: simple_add_dataset_ownership
    config:
      owner_urns: ["urn:li:corpuser:data-team"]
sink:
  type: datahub-rest
  config:
    server: "http://datahub-gms:8080"

Hands-On Tutorial

Run Ingestion

datahub ingest -c snowflake_recipe.yml
datahub ingest -c snowflake_recipe.yml --dry-run  # Preview

Best Practices

Schedule via Airflow or cron
Use stateful ingestion for incremental updates
Enable table + column lineage
Auto-tag PII columns via transformers

Practice Problems

Practice 1

Write a recipe for BigQuery that only ingests *_fact and *_dim tables with auto-tagging.

Quick Reference

Source	Type	Lineage
Snowflake	snowflake	Yes
BigQuery	bigquery	Yes
dbt	dbt-cloud	Yes
Airflow	airflow	Yes
Kafka	kafka	No

Overview

Automated Metadata Collection

Core Concepts

Recipe

Source

Transformer

Sink

How It Works

Hands-On Tutorial

Best Practices

Practice Problems

Practice 1

Quick Reference

Related Topics