⚡ GPU Orchestration

Master GPU resource management and orchestration for AI workloads at scale

↓ Scroll to explore

🎮 GPU Fundamentals

🖥️ GPU vs CPU for ML
GPUs excel at parallel processing, making them ideal for machine learning workloads that involve large matrix operations.
Example: Training a neural network on CPU might take days, while a GPU can complete it in hours.

Parallel Processing

Thousands of cores for simultaneous computations

Memory Bandwidth

High-speed memory for fast data access

Understanding GPU architecture helps optimize ML workloads for better performance and resource utilization.
GPU Model Memory Use Case
RTX 4090 24GB Development
A100 40/80GB Training
H100 80GB LLM Training
# Check GPU availability in PyTorch import torch if torch.cuda.is_available(): device = torch.device("cuda") print(f"GPU: {torch.cuda.get_device_name(0)}") print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
Advanced GPU optimization involves understanding CUDA kernels, memory coalescing, and hardware-specific features.
# Advanced GPU memory management import torch from torch.cuda.amp import autocast, GradScaler class GPUOptimizer: def __init__(self): self.scaler = GradScaler() torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.benchmark = True def optimize_memory(self): torch.cuda.empty_cache() torch.cuda.memory._set_allocator_settings('expandable_segments:True') def profile_kernel(self, func): with torch.profiler.profile( activities=[torch.profiler.ProfilerActivity.CUDA], record_shapes=True ) as prof: func() return prof.key_averages()

Performance Metrics

  • FLOPS: Floating point operations per second
  • Memory bandwidth utilization
  • SM efficiency and occupancy
  • Tensor Core utilization
🔄 Multi-GPU Training
Multi-GPU training allows you to train larger models faster by distributing the workload across multiple GPUs.
Example: Split a large batch across 4 GPUs to train 4x faster with data parallelism.

Data Parallel

Split batches across GPUs

Model Parallel

Split model layers across GPUs

Implement distributed training strategies using PyTorch's DistributedDataParallel for efficient multi-GPU utilization.
# PyTorch Distributed Training Setup import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel def setup(rank, world_size): dist.init_process_group("nccl", rank=rank, world_size=world_size) torch.cuda.set_device(rank) def train(rank, world_size): setup(rank, world_size) model = Model().to(rank) ddp_model = DistributedDataParallel(model, device_ids=[rank]) # Training loop for epoch in range(num_epochs): for batch in dataloader: outputs = ddp_model(batch) loss.backward()
Advanced distributed training with pipeline parallelism, tensor parallelism, and ZeRO optimization strategies.
# Advanced Multi-GPU with DeepSpeed { "train_batch_size": 256, "gradient_accumulation_steps": 4, "fp16": { "enabled": true, "loss_scale": 0 }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu" }, "offload_param": { "device": "nvme", "nvme_path": "/local_nvme" }, "stage3_gather_16bit_weights_on_model_save": true } }

Parallelism Strategies

  • Data Parallel: Replicate model, split data
  • Model Parallel: Split model across devices
  • Pipeline Parallel: Split layers into stages
  • Tensor Parallel: Split individual operations
  • ZeRO: Optimizer state sharding
📊 GPU Monitoring
Monitor GPU utilization, memory usage, and temperature to ensure optimal performance and prevent issues.
Example: Use nvidia-smi to check GPU status and track memory usage during training.
# Basic GPU monitoring commands $ nvidia-smi $ watch -n 1 nvidia-smi $ nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv
Set up comprehensive GPU monitoring with Prometheus and Grafana for production workloads.
# DCGM Exporter for Prometheus apiVersion: v1 kind: Service metadata: name: dcgm-exporter spec: ports: - name: metrics port: 9400 selector: app: dcgm-exporter --- apiVersion: apps/v1 kind: DaemonSet metadata: name: dcgm-exporter spec: template: spec: containers: - name: dcgm-exporter image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.8 env: - name: DCGM_EXPORTER_KUBERNETES value: "true"
Implement advanced monitoring with custom metrics, alerting, and automated remediation for GPU clusters.
# Advanced GPU monitoring with custom metrics import pynvml from prometheus_client import Gauge, start_http_server class GPUMonitor: def __init__(self): pynvml.nvmlInit() self.gpu_util = Gauge('gpu_utilization', 'GPU Utilization', ['gpu_id']) self.gpu_mem = Gauge('gpu_memory_used', 'GPU Memory Used', ['gpu_id']) self.gpu_temp = Gauge('gpu_temperature', 'GPU Temperature', ['gpu_id']) self.gpu_power = Gauge('gpu_power_draw', 'GPU Power Draw', ['gpu_id']) def collect_metrics(self): device_count = pynvml.nvmlDeviceGetCount() for i in range(device_count): handle = pynvml.nvmlDeviceGetHandleByIndex(i) util = pynvml.nvmlDeviceGetUtilizationRates(handle) self.gpu_util.labels(gpu_id=str(i)).set(util.gpu) mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle) self.gpu_mem.labels(gpu_id=str(i)).set(mem_info.used) temp = pynvml.nvmlDeviceGetTemperature(handle, 0) self.gpu_temp.labels(gpu_id=str(i)).set(temp) power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0 self.gpu_power.labels(gpu_id=str(i)).set(power)

☸️ Kubernetes Infrastructure

🎯 GPU Scheduling
Kubernetes can schedule and manage GPU resources for ML workloads using device plugins.
Example: Basic GPU Pod
Request GPU resources in your pod specification to ensure proper scheduling.
# Basic GPU pod request apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: cuda-container image: nvidia/cuda:11.8 resources: limits: nvidia.com/gpu: 1
Configure GPU node pools, device plugins, and resource management for production workloads.
# Kubernetes GPU Job with proper configuration apiVersion: batch/v1 kind: Job metadata: name: training-job spec: parallelism: 4 template: spec: containers: - name: training image: pytorch/pytorch:latest resources: requests: nvidia.com/gpu: 2 memory: "32Gi" cpu: "8" limits: nvidia.com/gpu: 2 memory: "64Gi" env: - name: NCCL_DEBUG value: "INFO" nodeSelector: node.kubernetes.io/instance-type: "p4d.24xlarge" tolerations: - key: nvidia.com/gpu operator: Exists
Advanced GPU scheduling with MIG (Multi-Instance GPU), time-slicing, and custom schedulers.
# MIG Configuration for GPU sharing apiVersion: v1 kind: ConfigMap metadata: name: mig-config data: config.yaml: | version: v1 mig-configs: all-1g.5gb: - devices: all mig-enabled: true mig-devices: "1g.5gb": 7 mixed: - device-filter: ["0x20B2"] devices: all mig-enabled: true mig-devices: "3g.20gb": 2 "1g.5gb": 1 --- # Time-slicing configuration sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 4 # 4 containers share 1 GPU
🚀 ML Operators
Kubernetes operators simplify the deployment and management of ML workloads by automating complex tasks.
Example: Kubeflow
Use Kubeflow to deploy and manage TensorFlow or PyTorch training jobs.

TFJob

TensorFlow distributed training

PyTorchJob

PyTorch distributed training

Deploy production ML workloads using Kubeflow operators for automated training and serving.
# PyTorchJob for distributed training apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: name: bert-training spec: pytorchReplicaSpecs: Master: replicas: 1 template: spec: containers: - name: pytorch image: bert-training:latest resources: limits: nvidia.com/gpu: 1 Worker: replicas: 3 template: spec: containers: - name: pytorch image: bert-training:latest resources: limits: nvidia.com/gpu: 2
Build custom operators and controllers for specialized ML workflows and resource management.
# Custom Training Operator CRD apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: trainingjobs.ml.company.com spec: group: ml.company.com versions: - name: v1 served: true storage: true schema: openAPIV3Schema: type: object properties: spec: type: object properties: model: type: string dataset: type: string gpuType: type: string enum: ["v100", "a100", "h100"] distributed: type: object properties: workers: type: integer strategy: type: string enum: ["ddp", "horovod", "deepspeed"]
📈 Auto-scaling
Automatically scale GPU resources based on workload demands to optimize costs and performance.
Example: HPA
Scale inference pods from 2 to 10 replicas when GPU utilization exceeds 70%.
# Basic HPA configuration apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: inference-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: model-server minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70
Configure HPA with custom metrics and cluster autoscaling for dynamic GPU node provisioning.
# HPA with GPU metrics apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: gpu-inference-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: model-server minReplicas: 2 maxReplicas: 20 metrics: - type: Pods pods: metric: name: gpu_utilization target: type: AverageValue averageValue: "75" - type: External external: metric: name: request_latency_p99 target: type: Value value: "100m" behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Percent value: 50 periodSeconds: 60
Implement predictive auto-scaling with ML-based forecasting and multi-dimensional scaling strategies.
# Advanced auto-scaling with KEDA apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: gpu-workload-scaler spec: scaleTargetRef: name: gpu-deployment minReplicaCount: 1 maxReplicaCount: 100 pollingInterval: 30 cooldownPeriod: 300 triggers: - type: prometheus metadata: serverAddress: http://prometheus:9090 metricName: gpu_memory_utilization query: | avg( DCGM_FI_DEV_MEM_COPY_UTIL{pod=~"gpu-deployment-.*"} ) threshold: '75' - type: kafka metadata: bootstrapServers: kafka:9092 consumerGroup: ml-requests topic: inference-requests lagThreshold: '100' advanced: horizontalPodAutoscalerConfig: behavior: scaleUp: selectPolicy: Max policies: - type: Percent value: 100 periodSeconds: 30

⚡ Optimization Techniques

💾
Memory Optimization
Optimize GPU memory usage to train larger models and process bigger batches.
Use mixed precision training to reduce memory usage by 50% while maintaining accuracy.
# Enable mixed precision training from torch.cuda.amp import autocast with autocast(): output = model(input) loss = criterion(output, target)
Implement gradient checkpointing and memory-efficient attention mechanisms for large models.
# Gradient checkpointing and mixed precision from torch.utils.checkpoint import checkpoint from torch.cuda.amp import GradScaler scaler = GradScaler() class CheckpointedModel(nn.Module): def forward(self, x): # Trade compute for memory x = checkpoint(self.layer1, x) x = checkpoint(self.layer2, x) return self.layer3(x) # Training loop for batch in dataloader: optimizer.zero_grad() with autocast(): loss = model(batch) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()
Advanced memory optimization with ZeRO, CPU offloading, and custom memory allocators.
# DeepSpeed ZeRO-3 configuration { "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true, "buffer_count": 4, "fast_init": false }, "offload_param": { "device": "cpu", "pin_memory": true, "buffer_count": 5, "buffer_size": 1e8, "max_in_cpu": 1e9 }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true } }
🔥
Performance Tuning
Tune GPU performance by optimizing batch sizes, learning rates, and data loading pipelines.
Increase batch size to maximize GPU utilization while monitoring memory usage.

Batch Size

Find optimal batch size for GPU

Data Pipeline

Optimize data loading speed

Profile and optimize GPU kernels, implement efficient data pipelines, and use compiler optimizations.
# PyTorch profiling for optimization from torch.profiler import profile, ProfilerActivity with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, with_stack=True ) as prof: with record_function("model_inference"): model(inputs) # Analyze results print(prof.key_averages().table( sort_by="cuda_time_total", row_limit=10 )) # Export for visualization prof.export_chrome_trace("trace.json")
Custom CUDA kernels, tensor core optimization, and hardware-specific tuning for maximum performance.
# Advanced optimization with torch.compile import torch._dynamo as dynamo import torch._inductor.config as config # Configure compilation config.triton.unique_kernel_names = True config.coordinate_descent_tuning = True config.max_autotune = True # Compile model with optimizations model = torch.compile( model, mode="max-autotune", fullgraph=True, dynamic=False ) # Custom CUDA kernel import triton import triton.language as tl @triton.jit def matmul_kernel( a_ptr, b_ptr, c_ptr, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr, ): # Optimized matrix multiplication pid = tl.program_id(axis=0) num_pid_m = tl.cdiv(M, BLOCK_SIZE_M) num_pid_n = tl.cdiv(N, BLOCK_SIZE_N) pid_m = pid // num_pid_n pid_n = pid % num_pid_n # ... kernel implementation
🏭
Production Best Practices
Follow best practices for deploying GPU workloads in production environments.
Always set resource limits, implement health checks, and monitor GPU metrics.
  • Set appropriate resource requests and limits
  • Implement health checks and readiness probes
  • Monitor GPU utilization and memory
  • Use appropriate GPU types for workloads
  • Implement proper error handling
Implement production-grade GPU infrastructure with proper isolation, monitoring, and cost optimization.

Production Checklist

  • ✅ GPU driver compatibility matrix
  • ✅ Resource quotas and limits
  • ✅ Multi-tenancy configuration
  • ✅ Monitoring and alerting setup
  • ✅ Backup and recovery procedures
  • ✅ Cost optimization strategies
  • ✅ Security and access controls
Strategy Description Use Case
Spot Instances Use preemptible GPUs Batch training
Reserved Instances Long-term commitment Production inference
Auto-scaling Dynamic resource allocation Variable workloads
Enterprise-grade GPU orchestration with advanced scheduling, multi-cloud strategies, and disaster recovery.

Enterprise Architecture

  • Multi-Cloud Strategy: Deploy across AWS, GCP, Azure
  • Hybrid Cloud: On-premise + cloud GPU resources
  • Federation: Cross-cluster GPU scheduling
  • DR Planning: Failover and backup strategies
  • Compliance: HIPAA, SOC2, GDPR requirements
# Enterprise GPU resource management apiVersion: v1 kind: ResourceQuota metadata: name: enterprise-gpu-quota spec: hard: nvidia.com/gpu: "100" nvidia.com/mig-1g.5gb: "200" requests.memory: "10Ti" requests.cpu: "1000" scopeSelector: matchExpressions: - scopeName: PriorityClass operator: In values: ["production", "critical"] --- apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: gpu-workload-pdb spec: minAvailable: 2 selector: matchLabels: app: gpu-inference