Performance Optimization in Go

Performance Fundamentals

Performance in Go

Go performance optimization focuses on three key areas: CPU efficiency, memory management, and concurrency. Understanding Go's runtime behavior is crucial for building high-performance applications.

Measurement First

Always measure performance before optimizing. Use profiling tools to identify real bottlenecks rather than guessing where problems might be.

Profile Benchmark

Runtime Behavior

Understanding Go's garbage collector, scheduler, and memory allocator behavior is key to writing performant applications.

GC-Aware Scheduler

Optimization Strategy

Focus on algorithmic improvements first, then micro-optimizations. Maintain code readability while improving performance.

Algorithm Readability

Profiling Go Applications

Use Go's built-in profiling tools to identify performance bottlenecks.

CPU Profiling

package main

import (
    "flag"
    "log"
    "os"
    "runtime/pprof"
)

var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to file")

func main() {
    flag.Parse()
    
    if *cpuprofile != "" {
        f, err := os.Create(*cpuprofile)
        if err != nil {
            log.Fatal(err)
        }
        defer f.Close()
        
        if err := pprof.StartCPUProfile(f); err != nil {
            log.Fatal(err)
        }
        defer pprof.StopCPUProfile()
    }
    
    // Your application code here
    doWork()
}

// Analyze profile:
// go tool pprof cpu.prof
// (pprof) top
// (pprof) list functionName
// (pprof) web

// Memory profiling
import _ "net/http/pprof"

func setupProfiling() {
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()
}

// Access profiles:
// go tool pprof http://localhost:6060/debug/pprof/heap
// go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
// go tool pprof http://localhost:6060/debug/pprof/goroutine

Memory Optimization

Techniques for reducing memory allocation and garbage collection pressure.

// Avoid unnecessary allocations
// Bad: Creates new slice each call
func badConcat(strs []string) string {
    result := ""
    for _, s := range strs {
        result += s // Allocates new string each time
    }
    return result
}

// Good: Use strings.Builder
func goodConcat(strs []string) string {
    var builder strings.Builder
    for _, s := range strs {
        builder.WriteString(s)
    }
    return builder.String()
}

// Preallocate slices
// Bad
func badSlice() []int {
    var result []int
    for i := 0; i < 1000; i++ {
        result = append(result, i) // Multiple reallocations
    }
    return result
}

// Good
func goodSlice() []int {
    result := make([]int, 0, 1000) // Preallocate capacity
    for i := 0; i < 1000; i++ {
        result = append(result, i)
    }
    return result
}

// Object pooling
var bufferPool = sync.Pool{
    New: func() interface{} {
        return make([]byte, 1024)
    },
}

func processWithPool(data []byte) {
    buf := bufferPool.Get().([]byte)
    defer bufferPool.Put(buf)
    
    // Use buffer
    copy(buf, data)
    // Process...
}

// Reduce allocations in hot paths
type Stats struct {
    count int64
    sum   int64
}

// Bad: Returns new struct (allocation)
func (s Stats) Add(value int64) Stats {
    return Stats{
        count: s.count + 1,
        sum:   s.sum + value,
    }
}

// Good: Modify in place (no allocation)
func (s *Stats) AddInPlace(value int64) {
    s.count++
    s.sum += value
}

// String interning for repeated strings
var internedStrings = make(map[string]string)
var internMu sync.RWMutex

func intern(s string) string {
    internMu.RLock()
    if interned, ok := internedStrings[s]; ok {
        internMu.RUnlock()
        return interned
    }
    internMu.RUnlock()
    
    internMu.Lock()
    internedStrings[s] = s
    internMu.Unlock()
    return s
}

Concurrency Optimization

Optimize concurrent code for better performance.

// Optimal goroutine pool size
func optimalWorkers() int {
    return runtime.NumCPU()
}

// Batching for reduced contention
type BatchProcessor struct {
    batch    []Item
    batchSize int
    mu       sync.Mutex
    process  func([]Item)
}

func (b *BatchProcessor) Add(item Item) {
    b.mu.Lock()
    b.batch = append(b.batch, item)
    
    if len(b.batch) >= b.batchSize {
        batch := b.batch
        b.batch = make([]Item, 0, b.batchSize)
        b.mu.Unlock()
        
        go b.process(batch)
    } else {
        b.mu.Unlock()
    }
}

// Lock-free data structures
type LockFreeCounter struct {
    value int64
}

func (c *LockFreeCounter) Increment() {
    atomic.AddInt64(&c.value, 1)
}

func (c *LockFreeCounter) Get() int64 {
    return atomic.LoadInt64(&c.value)
}

// Channel optimization
// Bad: Unbuffered channel causes blocking
func badChannel() {
    ch := make(chan int)
    go produce(ch)
    consume(ch)
}

// Good: Buffered channel reduces contention
func goodChannel() {
    ch := make(chan int, 100)
    go produce(ch)
    consume(ch)
}

// Reduce lock granularity
type ShardedMap struct {
    shards [16]shard
}

type shard struct {
    mu    sync.RWMutex
    items map[string]interface{}
}

func (m *ShardedMap) getShard(key string) *shard {
    hash := fnv32(key)
    return &m.shards[hash%16]
}

func (m *ShardedMap) Set(key string, value interface{}) {
    shard := m.getShard(key)
    shard.mu.Lock()
    shard.items[key] = value
    shard.mu.Unlock()
}

func (m *ShardedMap) Get(key string) (interface{}, bool) {
    shard := m.getShard(key)
    shard.mu.RLock()
    val, ok := shard.items[key]
    shard.mu.RUnlock()
    return val, ok
}

Benchmarking

Write effective benchmarks to measure performance improvements.

// Basic benchmark
func BenchmarkFunction(b *testing.B) {
    for i := 0; i < b.N; i++ {
        // Code to benchmark
        result := expensiveOperation()
        _ = result // Prevent optimization
    }
}

// Benchmark with setup
func BenchmarkWithSetup(b *testing.B) {
    // Setup code (not timed)
    data := generateTestData()
    
    b.ResetTimer() // Reset timer after setup
    
    for i := 0; i < b.N; i++ {
        processData(data)
    }
}

// Parallel benchmark
func BenchmarkParallel(b *testing.B) {
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            // Concurrent operation
            doWork()
        }
    })
}

// Sub-benchmarks
func BenchmarkSizes(b *testing.B) {
    sizes := []int{10, 100, 1000, 10000}
    
    for _, size := range sizes {
        b.Run(fmt.Sprintf("size-%d", size), func(b *testing.B) {
            data := make([]int, size)
            b.ResetTimer()
            
            for i := 0; i < b.N; i++ {
                sort.Ints(data)
            }
        })
    }
}

// Memory allocation benchmark
func BenchmarkAllocation(b *testing.B) {
    b.ReportAllocs() // Report allocation statistics
    
    for i := 0; i < b.N; i++ {
        s := make([]int, 100)
        _ = s
    }
}

// Custom metrics
func BenchmarkCustom(b *testing.B) {
    var totalBytes int64
    
    for i := 0; i < b.N; i++ {
        data := processFile()
        totalBytes += int64(len(data))
    }
    
    b.SetBytes(totalBytes / int64(b.N))
}

// Run benchmarks:
// go test -bench=.
// go test -bench=. -benchmem
// go test -bench=. -benchtime=10s
// go test -bench=. -cpu=1,2,4,8

Optimization Techniques

Specific techniques for optimizing Go code.

// Bounds check elimination
func sum(nums []int) int {
    if len(nums) == 0 {
        return 0
    }
    
    total := 0
    // Compiler can eliminate bounds checks
    for i := range nums {
        total += nums[i]
    }
    return total
}

// Escape analysis optimization
// go build -gcflags="-m"

// Stack allocation (doesn't escape)
func stackAlloc() int {
    x := 42 // Allocated on stack
    return x
}

// Heap allocation (escapes)
func heapAlloc() *int {
    x := 42 // Escapes to heap
    return &x
}

// Inlining
// Small functions are inlined automatically
func add(a, b int) int { // Will be inlined
    return a + b
}

// Prevent inlining with //go:noinline
//go:noinline
func noInline(x int) int {
    return x * 2
}

// Fast paths for common cases
func processValue(v interface{}) string {
    // Fast path for common types
    switch val := v.(type) {
    case string:
        return val
    case int:
        return strconv.Itoa(val)
    default:
        // Slow path for other types
        return fmt.Sprintf("%v", v)
    }
}

// Table-driven alternatives
var dayNames = [7]string{
    "Sunday", "Monday", "Tuesday", "Wednesday",
    "Thursday", "Friday", "Saturday",
}

func getDayName(day int) string {
    if day < 0 || day > 6 {
        return "Invalid"
    }
    return dayNames[day] // Faster than switch
}

// SIMD-friendly code
func vectorAdd(a, b []float64) []float64 {
    n := len(a)
    result := make([]float64, n)
    
    // Process in chunks for potential SIMD
    for i := 0; i < n-3; i += 4 {
        result[i] = a[i] + b[i]
        result[i+1] = a[i+1] + b[i+1]
        result[i+2] = a[i+2] + b[i+2]
        result[i+3] = a[i+3] + b[i+3]
    }
    
    // Handle remainder
    for i := n - n%4; i < n; i++ {
        result[i] = a[i] + b[i]
    }
    
    return result
}

Advanced Optimization Patterns

Advanced techniques for achieving maximum performance in Go applications.

// Zero-allocation string builder using unsafe
type FastBuilder struct {
    buf []byte
}

func (fb *FastBuilder) WriteString(s string) {
    fb.buf = append(fb.buf, s...)
}

func (fb *FastBuilder) String() string {
    return *(*string)(unsafe.Pointer(&fb.buf))
}

// Memory pool for hot path allocations
type BufferPool struct {
    pool sync.Pool
}

func NewBufferPool() *BufferPool {
    return &BufferPool{
        pool: sync.Pool{
            New: func() interface{} {
                return make([]byte, 0, 1024)
            },
        },
    }
}

func (bp *BufferPool) Get() []byte {
    return bp.pool.Get().([]byte)[:0]
}

func (bp *BufferPool) Put(buf []byte) {
    if cap(buf) <= 16384 { // Prevent memory leaks
        bp.pool.Put(buf)
    }
}

// Lock-free counter using atomic operations
type AtomicCounter struct {
    value int64
}

func (ac *AtomicCounter) Increment() int64 {
    return atomic.AddInt64(&ac.value, 1)
}

func (ac *AtomicCounter) Get() int64 {
    return atomic.LoadInt64(&ac.value)
}

// High-performance hash map with linear probing
type FastMap struct {
    keys   []string
    values []interface{}
    mask   int
}

func NewFastMap(size int) *FastMap {
    // Round up to next power of 2
    size = int(1 << uint(64-bits.LeadingZeros(uint(size-1))))
    return &FastMap{
        keys:   make([]string, size),
        values: make([]interface{}, size),
        mask:   size - 1,
    }
}

func (fm *FastMap) hash(key string) int {
    h := fnv.New32a()
    h.Write([]byte(key))
    return int(h.Sum32()) & fm.mask
}

Performance Case Studies

Real-World Optimization Examples

Learn from actual performance improvements achieved through systematic profiling and optimization.

Before: Slow JSON Processing

// Inefficient: Multiple passes, allocations
func ProcessJSONSlow(data []byte) ([]Result, error) {
    var items []map[string]interface{}
    if err := json.Unmarshal(data, &items); err != nil {
        return nil, err
    }
    
    results := []Result{}
    for _, item := range items {
        if name, ok := item["name"].(string); ok {
            results = append(results, Result{Name: name})
        }
    }
    return results, nil
}

Performance: 850ms for 100k records

After: Optimized Processing

// Efficient: Direct unmarshaling, pre-allocated
type JSONItem struct {
    Name string `json:"name"`
}

func ProcessJSONFast(data []byte) ([]Result, error) {
    var items []JSONItem
    if err := json.Unmarshal(data, &items); err != nil {
        return nil, err
    }
    
    results := make([]Result, 0, len(items))
    for _, item := range items {
        results = append(results, Result{Name: item.Name})
    }
    return results, nil
}

Performance: 95ms for 100k records (9x faster)

Optimization	Before	After	Improvement	Key Technique
HTTP Handler	45k req/s	165k req/s	3.7x	Buffer pooling, reduced allocations
Database Queries	2.8s	280ms	10x	Connection pooling, prepared statements
CSV Processing	4.5s	520ms	8.7x	Memory mapping, zero-copy parsing
Cache Lookups	150μs	12μs	12.5x	Lock-free data structures

Performance Best Practices

Measurement Strategy

Profile production workloads
Use realistic data patterns
Measure multiple dimensions
Consider percentiles over averages
Validate in production environment

Optimization Priorities

Algorithm complexity improvements
Hot path memory allocation reduction
I/O and concurrency optimization
Data structure selection
Micro-optimizations (last resort)

Go-Specific Techniques

Leverage escape analysis knowledge
Use sync.Pool for frequent allocations
Understand GC tuning parameters
Profile goroutine scheduling
Consider unsafe for critical paths

Quick Performance Wins

Pre-allocate slices: make([]T, 0, expectedSize)
Use strings.Builder: For string concatenation
Buffer I/O: bufio.Reader/Writer wrappers
Reuse objects: sync.Pool for temporary allocations
Choose strconv: Over fmt for conversions

Performance Anti-Patterns

Premature optimization: Optimize without measuring first
Micro-benchmark tunnel vision: Ignoring real-world patterns
Memory vs CPU trade-off blindness: Not considering all resources
Optimization for the wrong bottleneck: Focus on proven hot paths
Readability sacrifice: Complex code for marginal gains

Performance Challenges

Hands-On Performance Projects

Master performance optimization through these challenging real-world scenarios:

1. High-Throughput Logger

Design a logging system handling 1M+ entries/second with minimal GC pressure and consistent latency.

Beginner Lock-Free

2. Memory-Efficient Cache

Build an LRU cache supporting millions of entries with < 50 bytes overhead per entry and concurrent access.

Intermediate Concurrent

3. Ultra-Fast JSON Parser

Create a domain-specific JSON parser 10x faster than encoding/json for your data schema.

Advanced Zero-Copy

4. Distributed Load Balancer

Implement a load balancer handling 100k+ connections with < 1ms routing latency and minimal memory per connection.

Expert Network