GPU Orchestration

Part of Module 4: AI in Production

🎮 GPU/TPU Fundamentals

GPU vs CPU for ML

Meaning: GPUs excel at parallel matrix operations, making them ideal for deep learning workloads.
Example: Training ResNet-50: CPU takes 10 days → GPU (V100) takes 8 hours → 30x speedup.

Hardware Comparison:

Type Model Memory Use Case
Consumer RTX 4090 24GB Development, Small Models
Data Center A100 40/80GB Training, Large Models
Data Center H100 80GB LLM Training
TPU TPU v4 32GB HBM Large-scale Training

Key Metrics:

  • FLOPS: Floating point operations per second
  • Memory Bandwidth: Data transfer speed
  • Tensor Cores: Specialized for matrix multiply
  • NVLink: High-speed GPU interconnect

Multi-GPU Strategies

Parallelism Types:

  • Data Parallel: Split batch across GPUs
  • Model Parallel: Split model across GPUs
  • Pipeline Parallel: Split layers into stages
  • Tensor Parallel: Split tensors across GPUs
# PyTorch Distributed Data Parallel
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel

# Initialize process group
dist.init_process_group(backend='nccl')

# Create model and move to GPU
model = MyModel().cuda()
model = DistributedDataParallel(model)

# Training loop
for batch in dataloader:
    inputs, labels = batch.cuda()
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

☸️ Kubernetes for ML Workloads

GPU Scheduling in Kubernetes

Meaning: Kubernetes can schedule and manage GPU resources for ML workloads using device plugins.
Example: Submit training job → K8s finds available GPU node → schedules pod → monitors resource usage → cleans up after completion.
# Kubernetes GPU Pod Specification
apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-pod
spec:
  containers:
  - name: training
    image: pytorch/pytorch:latest
    command: ["python", "train.py"]
    resources:
      limits:
        nvidia.com/gpu: 2  # Request 2 GPUs
        memory: "32Gi"
        cpu: "8"
    volumeMounts:
    - name: dataset
      mountPath: /data
    - name: models
      mountPath: /models
  
  nodeSelector:
    gpu-type: "a100"  # Schedule on A100 nodes
  
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

GPU Operators:

  • NVIDIA GPU Operator: Automates GPU node setup
  • Device Plugin: Exposes GPUs to kubelet
  • Container Toolkit: GPU support in containers
  • DCGM Exporter: GPU metrics for Prometheus

ML Operators

Kubeflow Training Operators:

  • TFJob: TensorFlow distributed training
  • PyTorchJob: PyTorch distributed training
  • MPIJob: Horovod/MPI training
  • XGBoostJob: XGBoost training
# PyTorchJob for Distributed Training
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: bert-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
          - name: pytorch
            image: bert-training:latest
            resources:
              limits:
                nvidia.com/gpu: 1
    
    Worker:
      replicas: 3  # 3 worker nodes
      template:
        spec:
          containers:
          - name: pytorch
            image: bert-training:latest
            resources:
              limits:
                nvidia.com/gpu: 2  # 2 GPUs per worker
            env:
            - name: NCCL_DEBUG
              value: "INFO"

📈 Auto-scaling Strategies

Horizontal Pod Autoscaling

Meaning: Automatically scale inference pods based on metrics like CPU, memory, or custom metrics.
Example: Inference service at 80% GPU utilization → HPA triggers → scales from 3 to 5 pods → load distributed.
# HPA with Custom Metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-server
  
  minReplicas: 2
  maxReplicas: 10
  
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  
  - type: Pods
    pods:
      metric:
        name: gpu_utilization
      target:
        type: AverageValue
        averageValue: "75"
  
  - type: External
    external:
      metric:
        name: request_latency_p99
      target:
        type: Value
        value: "100m"  # 100ms P99 latency
  
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50  # Scale up by 50%
        periodSeconds: 60

Cluster Autoscaling

Meaning: Automatically add/remove nodes from the cluster based on resource demands.
Example: Training job needs 8 GPUs → no available nodes → cluster autoscaler provisions new GPU node → job scheduled.

Autoscaling Strategies:

  • Reactive: Scale based on current demand
  • Predictive: Scale based on forecasted load
  • Scheduled: Scale based on time patterns
  • Burst: Quick scale for sudden spikes

Cost Optimization:

  • Use spot/preemptible instances for training
  • Mixed instance types (CPU + GPU nodes)
  • Node pool priorities
  • Bin packing for efficient resource use

⚡ GPU Optimization Techniques

Memory Optimization

Techniques:

  • Gradient Checkpointing: Trade compute for memory
  • Mixed Precision: FP16/BF16 training
  • Gradient Accumulation: Simulate larger batches
  • Model Sharding: ZeRO optimization
  • Offloading: CPU/NVMe offload
# Mixed Precision Training
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    optimizer.zero_grad()
    
    # Mixed precision forward pass
    with autocast():
        outputs = model(inputs)
        loss = criterion(outputs, labels)
    
    # Scale loss and backward
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

# DeepSpeed ZeRO Configuration
{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true
  }
}

Performance Profiling

Profiling Tools:

  • NVIDIA Nsight: GPU kernel analysis
  • PyTorch Profiler: Training bottlenecks
  • TensorBoard: Performance visualization
  • DCGM: Data center GPU monitoring
# PyTorch Profiler
from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    for _ in range(10):
        model(inputs)

# Print profiler results
print(prof.key_averages().table(
    sort_by="cuda_time_total", 
    row_limit=10
))

# Export to TensorBoard
prof.export_chrome_trace("trace.json")

Common Bottlenecks:

  • Data loading (CPU bottleneck)
  • Small batch sizes (GPU underutilization)
  • Host-device transfers
  • Synchronization points
  • Memory fragmentation

🏭 Production GPU Infrastructure

Multi-tenancy & Resource Sharing

GPU Sharing Strategies:

  • Time-slicing: Multiple workloads on one GPU
  • MIG (Multi-Instance GPU): Partition A100/H100
  • vGPU: Virtualized GPU resources
  • Queue-based: Job scheduling systems

Resource Quotas:

# Kubernetes ResourceQuota
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ml-team
spec:
  hard:
    nvidia.com/gpu: "8"
    requests.memory: "256Gi"
    persistentvolumeclaims: "10"
  
  scopeSelector:
    matchExpressions:
    - scopeName: PriorityClass
      operator: In
      values: ["high", "medium"]

Monitoring & Alerting

Key Metrics:

  • GPU Utilization: Compute and memory %
  • Temperature: Thermal throttling
  • Power Draw: Performance state
  • ECC Errors: Memory errors
  • NVLink Traffic: Multi-GPU communication

Monitoring Stack:

  • DCGM Exporter: GPU metrics collection
  • Prometheus: Metrics storage
  • Grafana: Visualization dashboards
  • AlertManager: Alert routing

✅ Best Practices

GPU Infrastructure Guidelines

Architecture Best Practices:

  • Separate training and inference clusters
  • Use node affinity for GPU types
  • Implement proper resource limits
  • Enable GPU monitoring from day one
  • Plan for multi-tenancy
  • Automate node provisioning

Cost Optimization:

  • Use spot instances for training (with checkpointing)
  • Reserved instances for inference
  • Right-size GPU types to workloads
  • Implement idle timeout policies
  • Share GPUs when possible
  • Consider cloud vs on-premise trade-offs

Common Pitfalls:

  • Not setting resource limits → OOM kills
  • Ignoring GPU driver versions
  • Missing CUDA compatibility checks
  • No GPU health monitoring
  • Inefficient data loading pipelines
  • Not using mixed precision training

Module 4: AI in Production Topics