Overview
API-First Platform
DataHub is built API-first. Everything you can do in the UI can be done via the GraphQL API (primary) or REST API (OpenAPI). This enables automation, CI/CD integration, and custom tooling.
Core Concepts
GraphQL API
Primary API for querying and mutating metadata. Supports search, entity CRUD, lineage traversal. Available at /api/graphql.
REST (OpenAPI)
RESTful endpoints for entity operations. Swagger docs at /openapi/swagger-ui.
Python SDK
High-level Python client wrapping both APIs. Install via pip install acryl-datahub.
Authentication
Token-based auth (PATs) and OIDC. Tokens scoped to user permissions.
How It Works
GraphQL Queries
# Search for datasets
query { search(input: { type: DATASET, query: "revenue", start: 0, count: 10 }) {
total searchResults { entity { urn type ... on Dataset { name } } }
} }
# Get dataset with lineage
query { dataset(urn: "urn:li:dataset:(...)") {
name properties { description }
ownership { owners { owner { urn } } }
lineage(input: { direction: UPSTREAM, count: 10 }) { relationships { entity { urn } } }
} }
# Add a tag
mutation { addTag(input: { tagUrn: "urn:li:tag:PII", resourceUrn: "urn:li:dataset:(...)" }) }Hands-On Tutorial
Python SDK
from datahub.ingestion.graph.client import DataHubGraph
graph = DataHubGraph(config={"server": "http://localhost:8080"})
results = graph.execute_graphql("{ search(input: {type: DATASET, query: \"revenue\"}) { total } }")
print(results)Best Practices
- Use GraphQL as primary API
- Use Personal Access Tokens for service accounts
- Implement pagination for search results
- Use batch mutations for bulk operations
Practice Problems
Practice 1
Write a script that finds all datasets without an owner and notifies via Slack.
Quick Reference
| Endpoint | Method | Purpose |
|---|---|---|
| /api/graphql | POST | All metadata operations |
| /openapi/v2/entity | GET/POST | REST CRUD |