⚡ GPU Orchestration

Master GPU resource management and orchestration for AI workloads at scale

↓ Scroll to explore

🎮 GPU Fundamentals

🖥️ GPU vs CPU for ML

GPUs excel at parallel processing, making them ideal for machine learning workloads that involve large matrix operations.

Example: Training a neural network on CPU might take days, while a GPU can complete it in hours.

Parallel Processing

Thousands of cores for simultaneous computations

Memory Bandwidth

High-speed memory for fast data access

Understanding GPU architecture helps optimize ML workloads for better performance and resource utilization.

GPU Model	Memory	Use Case
RTX 4090	24GB	Development
A100	40/80GB	Training
H100	80GB	LLM Training

                            # Check GPU availability in PyTorch
                            import torch
                            
                            if torch.cuda.is_available():
                                device = torch.device("cuda")
                                print(f"GPU: {torch.cuda.get_device_name(0)}")
                                print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
                        

Advanced GPU optimization involves understanding CUDA kernels, memory coalescing, and hardware-specific features.

                            # Advanced GPU memory management
                            import torch
                            from torch.cuda.amp import autocast, GradScaler
                            
                            class GPUOptimizer:
                                def __init__(self):
                                    self.scaler = GradScaler()
                                    torch.backends.cuda.matmul.allow_tf32 = True
                                    torch.backends.cudnn.benchmark = True
                                
                                def optimize_memory(self):
                                    torch.cuda.empty_cache()
                                    torch.cuda.memory._set_allocator_settings('expandable_segments:True')
                                    
                                def profile_kernel(self, func):
                                    with torch.profiler.profile(
                                        activities=[torch.profiler.ProfilerActivity.CUDA],
                                        record_shapes=True
                                    ) as prof:
                                        func()
                                    return prof.key_averages()
                        

                            Performance Metrics
                            FLOPS: Floating point operations per second
Memory bandwidth utilization
SM efficiency and occupancy
Tensor Core utilization

                        

🔄 Multi-GPU Training

Multi-GPU training allows you to train larger models faster by distributing the workload across multiple GPUs.

Example: Split a large batch across 4 GPUs to train 4x faster with data parallelism.

Data Parallel

Split batches across GPUs

Model Parallel

Split model layers across GPUs

Implement distributed training strategies using PyTorch's DistributedDataParallel for efficient multi-GPU utilization.

                            # PyTorch Distributed Training Setup
                            import torch.distributed as dist
                            from torch.nn.parallel import DistributedDataParallel
                            
                            def setup(rank, world_size):
                                dist.init_process_group("nccl", rank=rank, world_size=world_size)
                                torch.cuda.set_device(rank)
                            
                            def train(rank, world_size):
                                setup(rank, world_size)
                                model = Model().to(rank)
                                ddp_model = DistributedDataParallel(model, device_ids=[rank])
                                
                                # Training loop
                                for epoch in range(num_epochs):
                                    for batch in dataloader:
                                        outputs = ddp_model(batch)
                                        loss.backward()
                        

Advanced distributed training with pipeline parallelism, tensor parallelism, and ZeRO optimization strategies.

                            # Advanced Multi-GPU with DeepSpeed
                            {
                                "train_batch_size": 256,
                                "gradient_accumulation_steps": 4,
                                "fp16": {
                                    "enabled": true,
                                    "loss_scale": 0
                                },
                                "zero_optimization": {
                                    "stage": 3,
                                    "offload_optimizer": {
                                        "device": "cpu"
                                    },
                                    "offload_param": {
                                        "device": "nvme",
                                        "nvme_path": "/local_nvme"
                                    },
                                    "stage3_gather_16bit_weights_on_model_save": true
                                }
                            }
                        

                            Parallelism Strategies
                            Data Parallel: Replicate model, split data
Model Parallel: Split model across devices
Pipeline Parallel: Split layers into stages
Tensor Parallel: Split individual operations
ZeRO: Optimizer state sharding

                        

📊 GPU Monitoring

Monitor GPU utilization, memory usage, and temperature to ensure optimal performance and prevent issues.

Example: Use nvidia-smi to check GPU status and track memory usage during training.

                            # Basic GPU monitoring commands
                            $ nvidia-smi
                            $ watch -n 1 nvidia-smi
                            $ nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv
                        

Set up comprehensive GPU monitoring with Prometheus and Grafana for production workloads.

                            # DCGM Exporter for Prometheus
                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: dcgm-exporter
                            spec:
                              ports:
                              - name: metrics
                                port: 9400
                              selector:
                                app: dcgm-exporter
                            ---
                            apiVersion: apps/v1
                            kind: DaemonSet
                            metadata:
                              name: dcgm-exporter
                            spec:
                              template:
                                spec:
                                  containers:
                                  - name: dcgm-exporter
                                    image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.8
                                    env:
                                    - name: DCGM_EXPORTER_KUBERNETES
                                      value: "true"
                        

Implement advanced monitoring with custom metrics, alerting, and automated remediation for GPU clusters.

                            # Advanced GPU monitoring with custom metrics
                            import pynvml
                            from prometheus_client import Gauge, start_http_server
                            
                            class GPUMonitor:
                                def __init__(self):
                                    pynvml.nvmlInit()
                                    self.gpu_util = Gauge('gpu_utilization', 'GPU Utilization', ['gpu_id'])
                                    self.gpu_mem = Gauge('gpu_memory_used', 'GPU Memory Used', ['gpu_id'])
                                    self.gpu_temp = Gauge('gpu_temperature', 'GPU Temperature', ['gpu_id'])
                                    self.gpu_power = Gauge('gpu_power_draw', 'GPU Power Draw', ['gpu_id'])
                                
                                def collect_metrics(self):
                                    device_count = pynvml.nvmlDeviceGetCount()
                                    for i in range(device_count):
                                        handle = pynvml.nvmlDeviceGetHandleByIndex(i)
                                        
                                        util = pynvml.nvmlDeviceGetUtilizationRates(handle)
                                        self.gpu_util.labels(gpu_id=str(i)).set(util.gpu)
                                        
                                        mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
                                        self.gpu_mem.labels(gpu_id=str(i)).set(mem_info.used)
                                        
                                        temp = pynvml.nvmlDeviceGetTemperature(handle, 0)
                                        self.gpu_temp.labels(gpu_id=str(i)).set(temp)
                                        
                                        power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0
                                        self.gpu_power.labels(gpu_id=str(i)).set(power)
                        

☸️ Kubernetes Infrastructure

🎯 GPU Scheduling

Kubernetes can schedule and manage GPU resources for ML workloads using device plugins.

Example: Basic GPU Pod

Request GPU resources in your pod specification to ensure proper scheduling.

                            # Basic GPU pod request
                            apiVersion: v1
                            kind: Pod
                            metadata:
                              name: gpu-pod
                            spec:
                              containers:
                              - name: cuda-container
                                image: nvidia/cuda:11.8
                                resources:
                                  limits:
                                    nvidia.com/gpu: 1
                        

Configure GPU node pools, device plugins, and resource management for production workloads.

                            # Kubernetes GPU Job with proper configuration
                            apiVersion: batch/v1
                            kind: Job
                            metadata:
                              name: training-job
                            spec:
                              parallelism: 4
                              template:
                                spec:
                                  containers:
                                  - name: training
                                    image: pytorch/pytorch:latest
                                    resources:
                                      requests:
                                        nvidia.com/gpu: 2
                                        memory: "32Gi"
                                        cpu: "8"
                                      limits:
                                        nvidia.com/gpu: 2
                                        memory: "64Gi"
                                    env:
                                    - name: NCCL_DEBUG
                                      value: "INFO"
                                  nodeSelector:
                                    node.kubernetes.io/instance-type: "p4d.24xlarge"
                                  tolerations:
                                  - key: nvidia.com/gpu
                                    operator: Exists
                        

Advanced GPU scheduling with MIG (Multi-Instance GPU), time-slicing, and custom schedulers.

                            # MIG Configuration for GPU sharing
                            apiVersion: v1
                            kind: ConfigMap
                            metadata:
                              name: mig-config
                            data:
                              config.yaml: |
                                version: v1
                                mig-configs:
                                  all-1g.5gb:
                                    - devices: all
                                      mig-enabled: true
                                      mig-devices:
                                        "1g.5gb": 7
                                  mixed:
                                    - device-filter: ["0x20B2"]
                                      devices: all
                                      mig-enabled: true
                                      mig-devices:
                                        "3g.20gb": 2
                                        "1g.5gb": 1
                            ---
                            # Time-slicing configuration
                            sharing:
                              timeSlicing:
                                resources:
                                - name: nvidia.com/gpu
                                  replicas: 4  # 4 containers share 1 GPU
                        

🚀 ML Operators

Kubernetes operators simplify the deployment and management of ML workloads by automating complex tasks.

Example: Kubeflow

Use Kubeflow to deploy and manage TensorFlow or PyTorch training jobs.

TFJob

TensorFlow distributed training

PyTorchJob

PyTorch distributed training

Deploy production ML workloads using Kubeflow operators for automated training and serving.

                            # PyTorchJob for distributed training
                            apiVersion: kubeflow.org/v1
                            kind: PyTorchJob
                            metadata:
                              name: bert-training
                            spec:
                              pytorchReplicaSpecs:
                                Master:
                                  replicas: 1
                                  template:
                                    spec:
                                      containers:
                                      - name: pytorch
                                        image: bert-training:latest
                                        resources:
                                          limits:
                                            nvidia.com/gpu: 1
                                Worker:
                                  replicas: 3
                                  template:
                                    spec:
                                      containers:
                                      - name: pytorch
                                        image: bert-training:latest
                                        resources:
                                          limits:
                                            nvidia.com/gpu: 2
                        

Build custom operators and controllers for specialized ML workflows and resource management.

                            # Custom Training Operator CRD
                            apiVersion: apiextensions.k8s.io/v1
                            kind: CustomResourceDefinition
                            metadata:
                              name: trainingjobs.ml.company.com
                            spec:
                              group: ml.company.com
                              versions:
                              - name: v1
                                served: true
                                storage: true
                                schema:
                                  openAPIV3Schema:
                                    type: object
                                    properties:
                                      spec:
                                        type: object
                                        properties:
                                          model:
                                            type: string
                                          dataset:
                                            type: string
                                          gpuType:
                                            type: string
                                            enum: ["v100", "a100", "h100"]
                                          distributed:
                                            type: object
                                            properties:
                                              workers: 
                                                type: integer
                                              strategy:
                                                type: string
                                                enum: ["ddp", "horovod", "deepspeed"]
                        

📈 Auto-scaling

Automatically scale GPU resources based on workload demands to optimize costs and performance.

Example: HPA

Scale inference pods from 2 to 10 replicas when GPU utilization exceeds 70%.

                            # Basic HPA configuration
                            apiVersion: autoscaling/v2
                            kind: HorizontalPodAutoscaler
                            metadata:
                              name: inference-hpa
                            spec:
                              scaleTargetRef:
                                apiVersion: apps/v1
                                kind: Deployment
                                name: model-server
                              minReplicas: 2
                              maxReplicas: 10
                              metrics:
                              - type: Resource
                                resource:
                                  name: cpu
                                  target:
                                    type: Utilization
                                    averageUtilization: 70
                        

Configure HPA with custom metrics and cluster autoscaling for dynamic GPU node provisioning.

                            # HPA with GPU metrics
                            apiVersion: autoscaling/v2
                            kind: HorizontalPodAutoscaler
                            metadata:
                              name: gpu-inference-hpa
                            spec:
                              scaleTargetRef:
                                apiVersion: apps/v1
                                kind: Deployment
                                name: model-server
                              minReplicas: 2
                              maxReplicas: 20
                              metrics:
                              - type: Pods
                                pods:
                                  metric:
                                    name: gpu_utilization
                                  target:
                                    type: AverageValue
                                    averageValue: "75"
                              - type: External
                                external:
                                  metric:
                                    name: request_latency_p99
                                  target:
                                    type: Value
                                    value: "100m"
                              behavior:
                                scaleUp:
                                  stabilizationWindowSeconds: 60
                                  policies:
                                  - type: Percent
                                    value: 50
                                    periodSeconds: 60
                        

Implement predictive auto-scaling with ML-based forecasting and multi-dimensional scaling strategies.

                            # Advanced auto-scaling with KEDA
                            apiVersion: keda.sh/v1alpha1
                            kind: ScaledObject
                            metadata:
                              name: gpu-workload-scaler
                            spec:
                              scaleTargetRef:
                                name: gpu-deployment
                              minReplicaCount: 1
                              maxReplicaCount: 100
                              pollingInterval: 30
                              cooldownPeriod: 300
                              triggers:
                              - type: prometheus
                                metadata:
                                  serverAddress: http://prometheus:9090
                                  metricName: gpu_memory_utilization
                                  query: |
                                    avg(
                                      DCGM_FI_DEV_MEM_COPY_UTIL{pod=~"gpu-deployment-.*"}
                                    )
                                  threshold: '75'
                              - type: kafka
                                metadata:
                                  bootstrapServers: kafka:9092
                                  consumerGroup: ml-requests
                                  topic: inference-requests
                                  lagThreshold: '100'
                              advanced:
                                horizontalPodAutoscalerConfig:
                                  behavior:
                                    scaleUp:
                                      selectPolicy: Max
                                      policies:
                                      - type: Percent
                                        value: 100
                                        periodSeconds: 30
                        

⚡ Optimization Techniques

💾

Memory Optimization

Optimize GPU memory usage to train larger models and process bigger batches.

Use mixed precision training to reduce memory usage by 50% while maintaining accuracy.

                            # Enable mixed precision training
                            from torch.cuda.amp import autocast
                            
                            with autocast():
                                output = model(input)
                                loss = criterion(output, target)
                        

Implement gradient checkpointing and memory-efficient attention mechanisms for large models.

                            # Gradient checkpointing and mixed precision
                            from torch.utils.checkpoint import checkpoint
                            from torch.cuda.amp import GradScaler
                            
                            scaler = GradScaler()
                            
                            class CheckpointedModel(nn.Module):
                                def forward(self, x):
                                    # Trade compute for memory
                                    x = checkpoint(self.layer1, x)
                                    x = checkpoint(self.layer2, x)
                                    return self.layer3(x)
                            
                            # Training loop
                            for batch in dataloader:
                                optimizer.zero_grad()
                                with autocast():
                                    loss = model(batch)
                                scaler.scale(loss).backward()
                                scaler.step(optimizer)
                                scaler.update()
                        

Advanced memory optimization with ZeRO, CPU offloading, and custom memory allocators.

                            # DeepSpeed ZeRO-3 configuration
                            {
                                "zero_optimization": {
                                    "stage": 3,
                                    "offload_optimizer": {
                                        "device": "cpu",
                                        "pin_memory": true,
                                        "buffer_count": 4,
                                        "fast_init": false
                                    },
                                    "offload_param": {
                                        "device": "cpu",
                                        "pin_memory": true,
                                        "buffer_count": 5,
                                        "buffer_size": 1e8,
                                        "max_in_cpu": 1e9
                                    },
                                    "overlap_comm": true,
                                    "contiguous_gradients": true,
                                    "sub_group_size": 1e9,
                                    "reduce_bucket_size": "auto",
                                    "stage3_prefetch_bucket_size": "auto",
                                    "stage3_param_persistence_threshold": "auto",
                                    "stage3_max_live_parameters": 1e9,
                                    "stage3_max_reuse_distance": 1e9,
                                    "stage3_gather_16bit_weights_on_model_save": true
                                }
                            }
                        

🔥

Performance Tuning

Tune GPU performance by optimizing batch sizes, learning rates, and data loading pipelines.

Increase batch size to maximize GPU utilization while monitoring memory usage.

Batch Size

Find optimal batch size for GPU

Data Pipeline

Optimize data loading speed

Profile and optimize GPU kernels, implement efficient data pipelines, and use compiler optimizations.

                            # PyTorch profiling for optimization
                            from torch.profiler import profile, ProfilerActivity
                            
                            with profile(
                                activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
                                record_shapes=True,
                                profile_memory=True,
                                with_stack=True
                            ) as prof:
                                with record_function("model_inference"):
                                    model(inputs)
                            
                            # Analyze results
                            print(prof.key_averages().table(
                                sort_by="cuda_time_total", 
                                row_limit=10
                            ))
                            
                            # Export for visualization
                            prof.export_chrome_trace("trace.json")
                        

Custom CUDA kernels, tensor core optimization, and hardware-specific tuning for maximum performance.

                            # Advanced optimization with torch.compile
                            import torch._dynamo as dynamo
                            import torch._inductor.config as config
                            
                            # Configure compilation
                            config.triton.unique_kernel_names = True
                            config.coordinate_descent_tuning = True
                            config.max_autotune = True
                            
                            # Compile model with optimizations
                            model = torch.compile(
                                model,
                                mode="max-autotune",
                                fullgraph=True,
                                dynamic=False
                            )
                            
                            # Custom CUDA kernel
                            import triton
                            import triton.language as tl
                            
                            @triton.jit
                            def matmul_kernel(
                                a_ptr, b_ptr, c_ptr,
                                M, N, K,
                                stride_am, stride_ak,
                                stride_bk, stride_bn,
                                stride_cm, stride_cn,
                                BLOCK_SIZE_M: tl.constexpr,
                                BLOCK_SIZE_N: tl.constexpr,
                                BLOCK_SIZE_K: tl.constexpr,
                            ):
                                # Optimized matrix multiplication
                                pid = tl.program_id(axis=0)
                                num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)
                                num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)
                                pid_m = pid // num_pid_n
                                pid_n = pid % num_pid_n
                                
                                # ... kernel implementation
                        

🏭

Production Best Practices

Follow best practices for deploying GPU workloads in production environments.

Always set resource limits, implement health checks, and monitor GPU metrics.

Set appropriate resource requests and limits
Implement health checks and readiness probes
Monitor GPU utilization and memory
Use appropriate GPU types for workloads
Implement proper error handling

Implement production-grade GPU infrastructure with proper isolation, monitoring, and cost optimization.

                            Production Checklist
                            ✅ GPU driver compatibility matrix
✅ Resource quotas and limits
✅ Multi-tenancy configuration
✅ Monitoring and alerting setup
✅ Backup and recovery procedures
✅ Cost optimization strategies
✅ Security and access controls

                        

Strategy	Description	Use Case
Spot Instances	Use preemptible GPUs	Batch training
Reserved Instances	Long-term commitment	Production inference
Auto-scaling	Dynamic resource allocation	Variable workloads

Enterprise-grade GPU orchestration with advanced scheduling, multi-cloud strategies, and disaster recovery.

                            Enterprise Architecture
                            Multi-Cloud Strategy: Deploy across AWS, GCP, Azure
Hybrid Cloud: On-premise + cloud GPU resources
Federation: Cross-cluster GPU scheduling
DR Planning: Failover and backup strategies
Compliance: HIPAA, SOC2, GDPR requirements

                        

                            # Enterprise GPU resource management
                            apiVersion: v1
                            kind: ResourceQuota
                            metadata:
                              name: enterprise-gpu-quota
                            spec:
                              hard:
                                nvidia.com/gpu: "100"
                                nvidia.com/mig-1g.5gb: "200"
                                requests.memory: "10Ti"
                                requests.cpu: "1000"
                              scopeSelector:
                                matchExpressions:
                                - scopeName: PriorityClass
                                  operator: In
                                  values: ["production", "critical"]
                            ---
                            apiVersion: policy/v1
                            kind: PodDisruptionBudget
                            metadata:
                              name: gpu-workload-pdb
                            spec:
                              minAvailable: 2
                              selector:
                                matchLabels:
                                  app: gpu-inference