Performance Optimization in Go

Performance Fundamentals

Performance in Go

Go performance optimization focuses on three key areas: CPU efficiency, memory management, and concurrency. Understanding Go's runtime behavior is crucial for building high-performance applications.

Go Performance Optimization Areas CPU Performance • Algorithm efficiency • Loop optimizations • Function inlining • Bounds check elimination • Branch prediction Memory Performance • Allocation reduction • Garbage collection • Object pooling • Stack vs heap • Cache efficiency Concurrency • Goroutine efficiency • Channel optimization • Lock contention • Work distribution • Context switching Profiling & Measurement Tools CPU Profiler Memory Profiler Goroutine Profiler Trace Viewer Benchmarks go tool pprof • go test -bench • runtime/trace Performance Optimization Lifecycle 1. Measure 2. Profile 3. Analyze 4. Optimize 5. Verify 6. Repeat \"Premature optimization is the root of all evil\" - Donald Knuth Always measure first • Focus on bottlenecks • Maintain readability

Measurement First

Always measure performance before optimizing. Use profiling tools to identify real bottlenecks rather than guessing where problems might be.

Profile Benchmark

Runtime Behavior

Understanding Go's garbage collector, scheduler, and memory allocator behavior is key to writing performant applications.

GC-Aware Scheduler

Optimization Strategy

Focus on algorithmic improvements first, then micro-optimizations. Maintain code readability while improving performance.

Algorithm Readability

Profiling Go Applications

Use Go's built-in profiling tools to identify performance bottlenecks.

CPU Profiling

package main

import (
    "flag"
    "log"
    "os"
    "runtime/pprof"
)

var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to file")

func main() {
    flag.Parse()
    
    if *cpuprofile != "" {
        f, err := os.Create(*cpuprofile)
        if err != nil {
            log.Fatal(err)
        }
        defer f.Close()
        
        if err := pprof.StartCPUProfile(f); err != nil {
            log.Fatal(err)
        }
        defer pprof.StopCPUProfile()
    }
    
    // Your application code here
    doWork()
}

// Analyze profile:
// go tool pprof cpu.prof
// (pprof) top
// (pprof) list functionName
// (pprof) web

// Memory profiling
import _ "net/http/pprof"

func setupProfiling() {
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()
}

// Access profiles:
// go tool pprof http://localhost:6060/debug/pprof/heap
// go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
// go tool pprof http://localhost:6060/debug/pprof/goroutine

Memory Optimization

Techniques for reducing memory allocation and garbage collection pressure.

// Avoid unnecessary allocations
// Bad: Creates new slice each call
func badConcat(strs []string) string {
    result := ""
    for _, s := range strs {
        result += s // Allocates new string each time
    }
    return result
}

// Good: Use strings.Builder
func goodConcat(strs []string) string {
    var builder strings.Builder
    for _, s := range strs {
        builder.WriteString(s)
    }
    return builder.String()
}

// Preallocate slices
// Bad
func badSlice() []int {
    var result []int
    for i := 0; i < 1000; i++ {
        result = append(result, i) // Multiple reallocations
    }
    return result
}

// Good
func goodSlice() []int {
    result := make([]int, 0, 1000) // Preallocate capacity
    for i := 0; i < 1000; i++ {
        result = append(result, i)
    }
    return result
}

// Object pooling
var bufferPool = sync.Pool{
    New: func() interface{} {
        return make([]byte, 1024)
    },
}

func processWithPool(data []byte) {
    buf := bufferPool.Get().([]byte)
    defer bufferPool.Put(buf)
    
    // Use buffer
    copy(buf, data)
    // Process...
}

// Reduce allocations in hot paths
type Stats struct {
    count int64
    sum   int64
}

// Bad: Returns new struct (allocation)
func (s Stats) Add(value int64) Stats {
    return Stats{
        count: s.count + 1,
        sum:   s.sum + value,
    }
}

// Good: Modify in place (no allocation)
func (s *Stats) AddInPlace(value int64) {
    s.count++
    s.sum += value
}

// String interning for repeated strings
var internedStrings = make(map[string]string)
var internMu sync.RWMutex

func intern(s string) string {
    internMu.RLock()
    if interned, ok := internedStrings[s]; ok {
        internMu.RUnlock()
        return interned
    }
    internMu.RUnlock()
    
    internMu.Lock()
    internedStrings[s] = s
    internMu.Unlock()
    return s
}

Concurrency Optimization

Optimize concurrent code for better performance.

// Optimal goroutine pool size
func optimalWorkers() int {
    return runtime.NumCPU()
}

// Batching for reduced contention
type BatchProcessor struct {
    batch    []Item
    batchSize int
    mu       sync.Mutex
    process  func([]Item)
}

func (b *BatchProcessor) Add(item Item) {
    b.mu.Lock()
    b.batch = append(b.batch, item)
    
    if len(b.batch) >= b.batchSize {
        batch := b.batch
        b.batch = make([]Item, 0, b.batchSize)
        b.mu.Unlock()
        
        go b.process(batch)
    } else {
        b.mu.Unlock()
    }
}

// Lock-free data structures
type LockFreeCounter struct {
    value int64
}

func (c *LockFreeCounter) Increment() {
    atomic.AddInt64(&c.value, 1)
}

func (c *LockFreeCounter) Get() int64 {
    return atomic.LoadInt64(&c.value)
}

// Channel optimization
// Bad: Unbuffered channel causes blocking
func badChannel() {
    ch := make(chan int)
    go produce(ch)
    consume(ch)
}

// Good: Buffered channel reduces contention
func goodChannel() {
    ch := make(chan int, 100)
    go produce(ch)
    consume(ch)
}

// Reduce lock granularity
type ShardedMap struct {
    shards [16]shard
}

type shard struct {
    mu    sync.RWMutex
    items map[string]interface{}
}

func (m *ShardedMap) getShard(key string) *shard {
    hash := fnv32(key)
    return &m.shards[hash%16]
}

func (m *ShardedMap) Set(key string, value interface{}) {
    shard := m.getShard(key)
    shard.mu.Lock()
    shard.items[key] = value
    shard.mu.Unlock()
}

func (m *ShardedMap) Get(key string) (interface{}, bool) {
    shard := m.getShard(key)
    shard.mu.RLock()
    val, ok := shard.items[key]
    shard.mu.RUnlock()
    return val, ok
}

Benchmarking

Write effective benchmarks to measure performance improvements.

// Basic benchmark
func BenchmarkFunction(b *testing.B) {
    for i := 0; i < b.N; i++ {
        // Code to benchmark
        result := expensiveOperation()
        _ = result // Prevent optimization
    }
}

// Benchmark with setup
func BenchmarkWithSetup(b *testing.B) {
    // Setup code (not timed)
    data := generateTestData()
    
    b.ResetTimer() // Reset timer after setup
    
    for i := 0; i < b.N; i++ {
        processData(data)
    }
}

// Parallel benchmark
func BenchmarkParallel(b *testing.B) {
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            // Concurrent operation
            doWork()
        }
    })
}

// Sub-benchmarks
func BenchmarkSizes(b *testing.B) {
    sizes := []int{10, 100, 1000, 10000}
    
    for _, size := range sizes {
        b.Run(fmt.Sprintf("size-%d", size), func(b *testing.B) {
            data := make([]int, size)
            b.ResetTimer()
            
            for i := 0; i < b.N; i++ {
                sort.Ints(data)
            }
        })
    }
}

// Memory allocation benchmark
func BenchmarkAllocation(b *testing.B) {
    b.ReportAllocs() // Report allocation statistics
    
    for i := 0; i < b.N; i++ {
        s := make([]int, 100)
        _ = s
    }
}

// Custom metrics
func BenchmarkCustom(b *testing.B) {
    var totalBytes int64
    
    for i := 0; i < b.N; i++ {
        data := processFile()
        totalBytes += int64(len(data))
    }
    
    b.SetBytes(totalBytes / int64(b.N))
}

// Run benchmarks:
// go test -bench=.
// go test -bench=. -benchmem
// go test -bench=. -benchtime=10s
// go test -bench=. -cpu=1,2,4,8

Optimization Techniques

Specific techniques for optimizing Go code.

// Bounds check elimination
func sum(nums []int) int {
    if len(nums) == 0 {
        return 0
    }
    
    total := 0
    // Compiler can eliminate bounds checks
    for i := range nums {
        total += nums[i]
    }
    return total
}

// Escape analysis optimization
// go build -gcflags="-m"

// Stack allocation (doesn't escape)
func stackAlloc() int {
    x := 42 // Allocated on stack
    return x
}

// Heap allocation (escapes)
func heapAlloc() *int {
    x := 42 // Escapes to heap
    return &x
}

// Inlining
// Small functions are inlined automatically
func add(a, b int) int { // Will be inlined
    return a + b
}

// Prevent inlining with //go:noinline
//go:noinline
func noInline(x int) int {
    return x * 2
}

// Fast paths for common cases
func processValue(v interface{}) string {
    // Fast path for common types
    switch val := v.(type) {
    case string:
        return val
    case int:
        return strconv.Itoa(val)
    default:
        // Slow path for other types
        return fmt.Sprintf("%v", v)
    }
}

// Table-driven alternatives
var dayNames = [7]string{
    "Sunday", "Monday", "Tuesday", "Wednesday",
    "Thursday", "Friday", "Saturday",
}

func getDayName(day int) string {
    if day < 0 || day > 6 {
        return "Invalid"
    }
    return dayNames[day] // Faster than switch
}

// SIMD-friendly code
func vectorAdd(a, b []float64) []float64 {
    n := len(a)
    result := make([]float64, n)
    
    // Process in chunks for potential SIMD
    for i := 0; i < n-3; i += 4 {
        result[i] = a[i] + b[i]
        result[i+1] = a[i+1] + b[i+1]
        result[i+2] = a[i+2] + b[i+2]
        result[i+3] = a[i+3] + b[i+3]
    }
    
    // Handle remainder
    for i := n - n%4; i < n; i++ {
        result[i] = a[i] + b[i]
    }
    
    return result
}

Advanced Optimization Patterns

Advanced techniques for achieving maximum performance in Go applications.

// Zero-allocation string builder using unsafe
type FastBuilder struct {
    buf []byte
}

func (fb *FastBuilder) WriteString(s string) {
    fb.buf = append(fb.buf, s...)
}

func (fb *FastBuilder) String() string {
    return *(*string)(unsafe.Pointer(&fb.buf))
}

// Memory pool for hot path allocations
type BufferPool struct {
    pool sync.Pool
}

func NewBufferPool() *BufferPool {
    return &BufferPool{
        pool: sync.Pool{
            New: func() interface{} {
                return make([]byte, 0, 1024)
            },
        },
    }
}

func (bp *BufferPool) Get() []byte {
    return bp.pool.Get().([]byte)[:0]
}

func (bp *BufferPool) Put(buf []byte) {
    if cap(buf) <= 16384 { // Prevent memory leaks
        bp.pool.Put(buf)
    }
}

// Lock-free counter using atomic operations
type AtomicCounter struct {
    value int64
}

func (ac *AtomicCounter) Increment() int64 {
    return atomic.AddInt64(&ac.value, 1)
}

func (ac *AtomicCounter) Get() int64 {
    return atomic.LoadInt64(&ac.value)
}

// High-performance hash map with linear probing
type FastMap struct {
    keys   []string
    values []interface{}
    mask   int
}

func NewFastMap(size int) *FastMap {
    // Round up to next power of 2
    size = int(1 << uint(64-bits.LeadingZeros(uint(size-1))))
    return &FastMap{
        keys:   make([]string, size),
        values: make([]interface{}, size),
        mask:   size - 1,
    }
}

func (fm *FastMap) hash(key string) int {
    h := fnv.New32a()
    h.Write([]byte(key))
    return int(h.Sum32()) & fm.mask
}

Performance Case Studies

Real-World Optimization Examples

Learn from actual performance improvements achieved through systematic profiling and optimization.

Before: Slow JSON Processing

// Inefficient: Multiple passes, allocations
func ProcessJSONSlow(data []byte) ([]Result, error) {
    var items []map[string]interface{}
    if err := json.Unmarshal(data, &items); err != nil {
        return nil, err
    }
    
    results := []Result{}
    for _, item := range items {
        if name, ok := item["name"].(string); ok {
            results = append(results, Result{Name: name})
        }
    }
    return results, nil
}

Performance: 850ms for 100k records

After: Optimized Processing

// Efficient: Direct unmarshaling, pre-allocated
type JSONItem struct {
    Name string `json:"name"`
}

func ProcessJSONFast(data []byte) ([]Result, error) {
    var items []JSONItem
    if err := json.Unmarshal(data, &items); err != nil {
        return nil, err
    }
    
    results := make([]Result, 0, len(items))
    for _, item := range items {
        results = append(results, Result{Name: item.Name})
    }
    return results, nil
}

Performance: 95ms for 100k records (9x faster)

Optimization Before After Improvement Key Technique
HTTP Handler 45k req/s 165k req/s 3.7x Buffer pooling, reduced allocations
Database Queries 2.8s 280ms 10x Connection pooling, prepared statements
CSV Processing 4.5s 520ms 8.7x Memory mapping, zero-copy parsing
Cache Lookups 150μs 12μs 12.5x Lock-free data structures

Performance Best Practices

Measurement Strategy

  • Profile production workloads
  • Use realistic data patterns
  • Measure multiple dimensions
  • Consider percentiles over averages
  • Validate in production environment

Optimization Priorities

  1. Algorithm complexity improvements
  2. Hot path memory allocation reduction
  3. I/O and concurrency optimization
  4. Data structure selection
  5. Micro-optimizations (last resort)

Go-Specific Techniques

  • Leverage escape analysis knowledge
  • Use sync.Pool for frequent allocations
  • Understand GC tuning parameters
  • Profile goroutine scheduling
  • Consider unsafe for critical paths

Quick Performance Wins

  • Pre-allocate slices: make([]T, 0, expectedSize)
  • Use strings.Builder: For string concatenation
  • Buffer I/O: bufio.Reader/Writer wrappers
  • Reuse objects: sync.Pool for temporary allocations
  • Choose strconv: Over fmt for conversions

Performance Anti-Patterns

  • Premature optimization: Optimize without measuring first
  • Micro-benchmark tunnel vision: Ignoring real-world patterns
  • Memory vs CPU trade-off blindness: Not considering all resources
  • Optimization for the wrong bottleneck: Focus on proven hot paths
  • Readability sacrifice: Complex code for marginal gains

Performance Challenges

Hands-On Performance Projects

Master performance optimization through these challenging real-world scenarios:

1. High-Throughput Logger

Design a logging system handling 1M+ entries/second with minimal GC pressure and consistent latency.

Beginner Lock-Free

2. Memory-Efficient Cache

Build an LRU cache supporting millions of entries with < 50 bytes overhead per entry and concurrent access.

Intermediate Concurrent

3. Ultra-Fast JSON Parser

Create a domain-specific JSON parser 10x faster than encoding/json for your data schema.

Advanced Zero-Copy

4. Distributed Load Balancer

Implement a load balancer handling 100k+ connections with < 1ms routing latency and minimal memory per connection.

Expert Network