🎮 GPU/TPU Fundamentals
GPU vs CPU for ML
Meaning: GPUs excel at parallel matrix operations, making them ideal for deep learning workloads.
Example: Training ResNet-50: CPU takes 10 days → GPU (V100) takes 8 hours → 30x speedup.
Hardware Comparison:
Type | Model | Memory | Use Case |
---|---|---|---|
Consumer | RTX 4090 | 24GB | Development, Small Models |
Data Center | A100 | 40/80GB | Training, Large Models |
Data Center | H100 | 80GB | LLM Training |
TPU | TPU v4 | 32GB HBM | Large-scale Training |
Key Metrics:
- FLOPS: Floating point operations per second
- Memory Bandwidth: Data transfer speed
- Tensor Cores: Specialized for matrix multiply
- NVLink: High-speed GPU interconnect
Multi-GPU Strategies
Parallelism Types:
- Data Parallel: Split batch across GPUs
- Model Parallel: Split model across GPUs
- Pipeline Parallel: Split layers into stages
- Tensor Parallel: Split tensors across GPUs
# PyTorch Distributed Data Parallel import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel # Initialize process group dist.init_process_group(backend='nccl') # Create model and move to GPU model = MyModel().cuda() model = DistributedDataParallel(model) # Training loop for batch in dataloader: inputs, labels = batch.cuda() outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step()
☸️ Kubernetes for ML Workloads
GPU Scheduling in Kubernetes
Meaning: Kubernetes can schedule and manage GPU resources for ML workloads using device plugins.
Example: Submit training job → K8s finds available GPU node → schedules pod → monitors resource usage → cleans up after completion.
# Kubernetes GPU Pod Specification apiVersion: v1 kind: Pod metadata: name: gpu-training-pod spec: containers: - name: training image: pytorch/pytorch:latest command: ["python", "train.py"] resources: limits: nvidia.com/gpu: 2 # Request 2 GPUs memory: "32Gi" cpu: "8" volumeMounts: - name: dataset mountPath: /data - name: models mountPath: /models nodeSelector: gpu-type: "a100" # Schedule on A100 nodes tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule
GPU Operators:
- NVIDIA GPU Operator: Automates GPU node setup
- Device Plugin: Exposes GPUs to kubelet
- Container Toolkit: GPU support in containers
- DCGM Exporter: GPU metrics for Prometheus
ML Operators
Kubeflow Training Operators:
- TFJob: TensorFlow distributed training
- PyTorchJob: PyTorch distributed training
- MPIJob: Horovod/MPI training
- XGBoostJob: XGBoost training
# PyTorchJob for Distributed Training apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: name: bert-training spec: pytorchReplicaSpecs: Master: replicas: 1 template: spec: containers: - name: pytorch image: bert-training:latest resources: limits: nvidia.com/gpu: 1 Worker: replicas: 3 # 3 worker nodes template: spec: containers: - name: pytorch image: bert-training:latest resources: limits: nvidia.com/gpu: 2 # 2 GPUs per worker env: - name: NCCL_DEBUG value: "INFO"
📈 Auto-scaling Strategies
Horizontal Pod Autoscaling
Meaning: Automatically scale inference pods based on metrics like CPU, memory, or custom metrics.
Example: Inference service at 80% GPU utilization → HPA triggers → scales from 3 to 5 pods → load distributed.
# HPA with Custom Metrics apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: inference-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: model-server minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Pods pods: metric: name: gpu_utilization target: type: AverageValue averageValue: "75" - type: External external: metric: name: request_latency_p99 target: type: Value value: "100m" # 100ms P99 latency behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Percent value: 50 # Scale up by 50% periodSeconds: 60
Cluster Autoscaling
Meaning: Automatically add/remove nodes from the cluster based on resource demands.
Example: Training job needs 8 GPUs → no available nodes → cluster autoscaler provisions new GPU node → job scheduled.
Autoscaling Strategies:
- Reactive: Scale based on current demand
- Predictive: Scale based on forecasted load
- Scheduled: Scale based on time patterns
- Burst: Quick scale for sudden spikes
Cost Optimization:
- Use spot/preemptible instances for training
- Mixed instance types (CPU + GPU nodes)
- Node pool priorities
- Bin packing for efficient resource use
⚡ GPU Optimization Techniques
Memory Optimization
Techniques:
- Gradient Checkpointing: Trade compute for memory
- Mixed Precision: FP16/BF16 training
- Gradient Accumulation: Simulate larger batches
- Model Sharding: ZeRO optimization
- Offloading: CPU/NVMe offload
# Mixed Precision Training from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() for batch in dataloader: optimizer.zero_grad() # Mixed precision forward pass with autocast(): outputs = model(inputs) loss = criterion(outputs, labels) # Scale loss and backward scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() # DeepSpeed ZeRO Configuration { "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true } }
Performance Profiling
Profiling Tools:
- NVIDIA Nsight: GPU kernel analysis
- PyTorch Profiler: Training bottlenecks
- TensorBoard: Performance visualization
- DCGM: Data center GPU monitoring
# PyTorch Profiler from torch.profiler import profile, ProfilerActivity with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, with_stack=True ) as prof: for _ in range(10): model(inputs) # Print profiler results print(prof.key_averages().table( sort_by="cuda_time_total", row_limit=10 )) # Export to TensorBoard prof.export_chrome_trace("trace.json")
Common Bottlenecks:
- Data loading (CPU bottleneck)
- Small batch sizes (GPU underutilization)
- Host-device transfers
- Synchronization points
- Memory fragmentation
🏭 Production GPU Infrastructure
Multi-tenancy & Resource Sharing
GPU Sharing Strategies:
- Time-slicing: Multiple workloads on one GPU
- MIG (Multi-Instance GPU): Partition A100/H100
- vGPU: Virtualized GPU resources
- Queue-based: Job scheduling systems
Resource Quotas:
# Kubernetes ResourceQuota
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: ml-team
spec:
hard:
nvidia.com/gpu: "8"
requests.memory: "256Gi"
persistentvolumeclaims: "10"
scopeSelector:
matchExpressions:
- scopeName: PriorityClass
operator: In
values: ["high", "medium"]
Monitoring & Alerting
Key Metrics:
- GPU Utilization: Compute and memory %
- Temperature: Thermal throttling
- Power Draw: Performance state
- ECC Errors: Memory errors
- NVLink Traffic: Multi-GPU communication
Monitoring Stack:
- DCGM Exporter: GPU metrics collection
- Prometheus: Metrics storage
- Grafana: Visualization dashboards
- AlertManager: Alert routing
✅ Best Practices
GPU Infrastructure Guidelines
Architecture Best Practices:
- Separate training and inference clusters
- Use node affinity for GPU types
- Implement proper resource limits
- Enable GPU monitoring from day one
- Plan for multi-tenancy
- Automate node provisioning
Cost Optimization:
- Use spot instances for training (with checkpointing)
- Reserved instances for inference
- Right-size GPU types to workloads
- Implement idle timeout policies
- Share GPUs when possible
- Consider cloud vs on-premise trade-offs
Common Pitfalls:
- Not setting resource limits → OOM kills
- Ignoring GPU driver versions
- Missing CUDA compatibility checks
- No GPU health monitoring
- Inefficient data loading pipelines
- Not using mixed precision training
Module 4: AI in Production Topics
- ML Lifecycle
- Serving Frameworks
- MLOps & AIOps
- GPU Orchestration