Data Quality & Observability

Hard 30 min read

Overview

Why This Matters

DataHub integrates with quality tools (Great Expectations, dbt tests, Monte Carlo) to surface quality signals. Define assertions, monitor results over time, and alert on failures. Quality scores visible on dataset pages.

Core Concepts

Data Quality & Observability is a critical capability in DataHub's metadata platform. Understanding the core concepts helps you implement effective metadata management.

Configuration

DataHub provides both UI-based and API-based configuration for data quality & observability. Most settings can be managed through the admin panel or programmatically via GraphQL.

Integration

Works seamlessly with DataHub's ingestion framework, search index, and event system. Changes are automatically propagated across the platform.

Automation

Leverage DataHub Actions to automate data quality & observability workflows. Trigger actions on metadata changes, schedule periodic checks, and integrate with external systems.

Monitoring

Track usage and effectiveness through DataHub's analytics. Monitor adoption metrics, coverage, and compliance with organizational standards.

How It Works

Configuration
# Configure data quality & observability via DataHub CLI
datahub put --urn "urn:li:dataset:(...)" \
  --aspect "datasetProperties" \
  -d '{"description": "Configured via CLI"}'

# Or via Python SDK
from datahub.emitter.rest_emitter import DatahubRestEmitter
emitter = DatahubRestEmitter("http://localhost:8080")

# Emit metadata for data quality & observability
emitter.emit_mcp(
    entity_urn="urn:li:dataset:(...)",
    aspect_name="datasetProperties",
    aspect_value=DatasetPropertiesClass(
        description="Updated via SDK"
    )
)

Architecture Integration

When data quality & observability metadata is updated, DataHub emits a Metadata Change Event (MCE) to Kafka. Downstream consumers update the search index (Elasticsearch) and graph index, ensuring all views stay consistent in near real-time.

Hands-On Tutorial

Step-by-Step Setup
# Step 1: Verify DataHub is running
curl -s http://localhost:8080/config | python3 -m json.tool

# Step 2: Configure data quality & observability via GraphQL
curl -X POST http://localhost:8080/api/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "mutation { updateDataset(urn: \"urn:li:dataset:(...)\" input: {}) }"}'

# Step 3: Verify in the UI
# Navigate to http://localhost:9002 and check the entity page

Best Practices

Practice Problems

Practice 1

Design a data quality & observability strategy for a data team with 500 datasets across 8 databases. What do you prioritize? How do you measure success?

Practice 2

A new data engineer joins your team and needs to understand data quality & observability in DataHub. Create a 30-minute onboarding guide covering the essentials.

Practice 3

Your organization's data quality & observability adoption is at 30% after 3 months. Identify potential blockers and design an adoption acceleration plan.

Quick Reference

FeatureAccessNotes
UI ConfigurationSettings → Data Quality & ObservabilityPoint-and-click setup
GraphQL APIPOST /api/graphqlProgrammatic access
Python SDKpip install acryl-datahubHigh-level client
CLIdatahub put / datahub getCommand-line operations
ActionsEvent-driven triggersAutomation framework