Ingestion Framework

Medium 25 min read

Overview

Automated Metadata Collection

DataHub's ingestion framework pulls metadata from 50+ sources using a recipe pattern: YAML config with source, transformers, and sink.

Core Concepts

Recipe

YAML config defining source, transformers, and sink.

Source

Connector for Snowflake, BigQuery, Kafka, dbt, etc.

Transformer

Middleware to modify metadata in-flight.

Sink

Destination: datahub-rest or datahub-kafka.

How It Works

Ingestion Recipe
# snowflake_recipe.yml
source:
  type: snowflake
  config:
    account_id: "mycompany.us-east-1"
    username: "DATAHUB_USER"
    password: "${SNOWFLAKE_PASSWORD}"
    include_table_lineage: true
transformers:
  - type: simple_add_dataset_ownership
    config:
      owner_urns: ["urn:li:corpuser:data-team"]
sink:
  type: datahub-rest
  config:
    server: "http://datahub-gms:8080"

Hands-On Tutorial

Run Ingestion
datahub ingest -c snowflake_recipe.yml
datahub ingest -c snowflake_recipe.yml --dry-run  # Preview

Best Practices

Practice Problems

Practice 1

Write a recipe for BigQuery that only ingests *_fact and *_dim tables with auto-tagging.

Quick Reference

SourceTypeLineage
SnowflakesnowflakeYes
BigQuerybigqueryYes
dbtdbt-cloudYes
AirflowairflowYes
KafkakafkaNo