Performance Fundamentals
Performance in Go
Go performance optimization focuses on three key areas: CPU efficiency, memory management, and concurrency. Understanding Go's runtime behavior is crucial for building high-performance applications.
Measurement First
Always measure performance before optimizing. Use profiling tools to identify real bottlenecks rather than guessing where problems might be.
Profile BenchmarkRuntime Behavior
Understanding Go's garbage collector, scheduler, and memory allocator behavior is key to writing performant applications.
GC-Aware SchedulerOptimization Strategy
Focus on algorithmic improvements first, then micro-optimizations. Maintain code readability while improving performance.
Algorithm ReadabilityProfiling Go Applications
Use Go's built-in profiling tools to identify performance bottlenecks.
CPU Profiling
package main import ( "flag" "log" "os" "runtime/pprof" ) var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to file") func main() { flag.Parse() if *cpuprofile != "" { f, err := os.Create(*cpuprofile) if err != nil { log.Fatal(err) } defer f.Close() if err := pprof.StartCPUProfile(f); err != nil { log.Fatal(err) } defer pprof.StopCPUProfile() } // Your application code here doWork() } // Analyze profile: // go tool pprof cpu.prof // (pprof) top // (pprof) list functionName // (pprof) web // Memory profiling import _ "net/http/pprof" func setupProfiling() { go func() { log.Println(http.ListenAndServe("localhost:6060", nil)) }() } // Access profiles: // go tool pprof http://localhost:6060/debug/pprof/heap // go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30 // go tool pprof http://localhost:6060/debug/pprof/goroutine
Memory Optimization
Techniques for reducing memory allocation and garbage collection pressure.
// Avoid unnecessary allocations // Bad: Creates new slice each call func badConcat(strs []string) string { result := "" for _, s := range strs { result += s // Allocates new string each time } return result } // Good: Use strings.Builder func goodConcat(strs []string) string { var builder strings.Builder for _, s := range strs { builder.WriteString(s) } return builder.String() } // Preallocate slices // Bad func badSlice() []int { var result []int for i := 0; i < 1000; i++ { result = append(result, i) // Multiple reallocations } return result } // Good func goodSlice() []int { result := make([]int, 0, 1000) // Preallocate capacity for i := 0; i < 1000; i++ { result = append(result, i) } return result } // Object pooling var bufferPool = sync.Pool{ New: func() interface{} { return make([]byte, 1024) }, } func processWithPool(data []byte) { buf := bufferPool.Get().([]byte) defer bufferPool.Put(buf) // Use buffer copy(buf, data) // Process... } // Reduce allocations in hot paths type Stats struct { count int64 sum int64 } // Bad: Returns new struct (allocation) func (s Stats) Add(value int64) Stats { return Stats{ count: s.count + 1, sum: s.sum + value, } } // Good: Modify in place (no allocation) func (s *Stats) AddInPlace(value int64) { s.count++ s.sum += value } // String interning for repeated strings var internedStrings = make(map[string]string) var internMu sync.RWMutex func intern(s string) string { internMu.RLock() if interned, ok := internedStrings[s]; ok { internMu.RUnlock() return interned } internMu.RUnlock() internMu.Lock() internedStrings[s] = s internMu.Unlock() return s }
Concurrency Optimization
Optimize concurrent code for better performance.
// Optimal goroutine pool size func optimalWorkers() int { return runtime.NumCPU() } // Batching for reduced contention type BatchProcessor struct { batch []Item batchSize int mu sync.Mutex process func([]Item) } func (b *BatchProcessor) Add(item Item) { b.mu.Lock() b.batch = append(b.batch, item) if len(b.batch) >= b.batchSize { batch := b.batch b.batch = make([]Item, 0, b.batchSize) b.mu.Unlock() go b.process(batch) } else { b.mu.Unlock() } } // Lock-free data structures type LockFreeCounter struct { value int64 } func (c *LockFreeCounter) Increment() { atomic.AddInt64(&c.value, 1) } func (c *LockFreeCounter) Get() int64 { return atomic.LoadInt64(&c.value) } // Channel optimization // Bad: Unbuffered channel causes blocking func badChannel() { ch := make(chan int) go produce(ch) consume(ch) } // Good: Buffered channel reduces contention func goodChannel() { ch := make(chan int, 100) go produce(ch) consume(ch) } // Reduce lock granularity type ShardedMap struct { shards [16]shard } type shard struct { mu sync.RWMutex items map[string]interface{} } func (m *ShardedMap) getShard(key string) *shard { hash := fnv32(key) return &m.shards[hash%16] } func (m *ShardedMap) Set(key string, value interface{}) { shard := m.getShard(key) shard.mu.Lock() shard.items[key] = value shard.mu.Unlock() } func (m *ShardedMap) Get(key string) (interface{}, bool) { shard := m.getShard(key) shard.mu.RLock() val, ok := shard.items[key] shard.mu.RUnlock() return val, ok }
Benchmarking
Write effective benchmarks to measure performance improvements.
// Basic benchmark func BenchmarkFunction(b *testing.B) { for i := 0; i < b.N; i++ { // Code to benchmark result := expensiveOperation() _ = result // Prevent optimization } } // Benchmark with setup func BenchmarkWithSetup(b *testing.B) { // Setup code (not timed) data := generateTestData() b.ResetTimer() // Reset timer after setup for i := 0; i < b.N; i++ { processData(data) } } // Parallel benchmark func BenchmarkParallel(b *testing.B) { b.RunParallel(func(pb *testing.PB) { for pb.Next() { // Concurrent operation doWork() } }) } // Sub-benchmarks func BenchmarkSizes(b *testing.B) { sizes := []int{10, 100, 1000, 10000} for _, size := range sizes { b.Run(fmt.Sprintf("size-%d", size), func(b *testing.B) { data := make([]int, size) b.ResetTimer() for i := 0; i < b.N; i++ { sort.Ints(data) } }) } } // Memory allocation benchmark func BenchmarkAllocation(b *testing.B) { b.ReportAllocs() // Report allocation statistics for i := 0; i < b.N; i++ { s := make([]int, 100) _ = s } } // Custom metrics func BenchmarkCustom(b *testing.B) { var totalBytes int64 for i := 0; i < b.N; i++ { data := processFile() totalBytes += int64(len(data)) } b.SetBytes(totalBytes / int64(b.N)) } // Run benchmarks: // go test -bench=. // go test -bench=. -benchmem // go test -bench=. -benchtime=10s // go test -bench=. -cpu=1,2,4,8
Optimization Techniques
Specific techniques for optimizing Go code.
// Bounds check elimination func sum(nums []int) int { if len(nums) == 0 { return 0 } total := 0 // Compiler can eliminate bounds checks for i := range nums { total += nums[i] } return total } // Escape analysis optimization // go build -gcflags="-m" // Stack allocation (doesn't escape) func stackAlloc() int { x := 42 // Allocated on stack return x } // Heap allocation (escapes) func heapAlloc() *int { x := 42 // Escapes to heap return &x } // Inlining // Small functions are inlined automatically func add(a, b int) int { // Will be inlined return a + b } // Prevent inlining with //go:noinline //go:noinline func noInline(x int) int { return x * 2 } // Fast paths for common cases func processValue(v interface{}) string { // Fast path for common types switch val := v.(type) { case string: return val case int: return strconv.Itoa(val) default: // Slow path for other types return fmt.Sprintf("%v", v) } } // Table-driven alternatives var dayNames = [7]string{ "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", } func getDayName(day int) string { if day < 0 || day > 6 { return "Invalid" } return dayNames[day] // Faster than switch } // SIMD-friendly code func vectorAdd(a, b []float64) []float64 { n := len(a) result := make([]float64, n) // Process in chunks for potential SIMD for i := 0; i < n-3; i += 4 { result[i] = a[i] + b[i] result[i+1] = a[i+1] + b[i+1] result[i+2] = a[i+2] + b[i+2] result[i+3] = a[i+3] + b[i+3] } // Handle remainder for i := n - n%4; i < n; i++ { result[i] = a[i] + b[i] } return result }
Advanced Optimization Patterns
Advanced techniques for achieving maximum performance in Go applications.
// Zero-allocation string builder using unsafe type FastBuilder struct { buf []byte } func (fb *FastBuilder) WriteString(s string) { fb.buf = append(fb.buf, s...) } func (fb *FastBuilder) String() string { return *(*string)(unsafe.Pointer(&fb.buf)) } // Memory pool for hot path allocations type BufferPool struct { pool sync.Pool } func NewBufferPool() *BufferPool { return &BufferPool{ pool: sync.Pool{ New: func() interface{} { return make([]byte, 0, 1024) }, }, } } func (bp *BufferPool) Get() []byte { return bp.pool.Get().([]byte)[:0] } func (bp *BufferPool) Put(buf []byte) { if cap(buf) <= 16384 { // Prevent memory leaks bp.pool.Put(buf) } } // Lock-free counter using atomic operations type AtomicCounter struct { value int64 } func (ac *AtomicCounter) Increment() int64 { return atomic.AddInt64(&ac.value, 1) } func (ac *AtomicCounter) Get() int64 { return atomic.LoadInt64(&ac.value) } // High-performance hash map with linear probing type FastMap struct { keys []string values []interface{} mask int } func NewFastMap(size int) *FastMap { // Round up to next power of 2 size = int(1 << uint(64-bits.LeadingZeros(uint(size-1)))) return &FastMap{ keys: make([]string, size), values: make([]interface{}, size), mask: size - 1, } } func (fm *FastMap) hash(key string) int { h := fnv.New32a() h.Write([]byte(key)) return int(h.Sum32()) & fm.mask }
Performance Case Studies
Real-World Optimization Examples
Learn from actual performance improvements achieved through systematic profiling and optimization.
Before: Slow JSON Processing
// Inefficient: Multiple passes, allocations func ProcessJSONSlow(data []byte) ([]Result, error) { var items []map[string]interface{} if err := json.Unmarshal(data, &items); err != nil { return nil, err } results := []Result{} for _, item := range items { if name, ok := item["name"].(string); ok { results = append(results, Result{Name: name}) } } return results, nil }
Performance: 850ms for 100k records
After: Optimized Processing
// Efficient: Direct unmarshaling, pre-allocated type JSONItem struct { Name string `json:"name"` } func ProcessJSONFast(data []byte) ([]Result, error) { var items []JSONItem if err := json.Unmarshal(data, &items); err != nil { return nil, err } results := make([]Result, 0, len(items)) for _, item := range items { results = append(results, Result{Name: item.Name}) } return results, nil }
Performance: 95ms for 100k records (9x faster)
Optimization | Before | After | Improvement | Key Technique |
---|---|---|---|---|
HTTP Handler | 45k req/s | 165k req/s | 3.7x | Buffer pooling, reduced allocations |
Database Queries | 2.8s | 280ms | 10x | Connection pooling, prepared statements |
CSV Processing | 4.5s | 520ms | 8.7x | Memory mapping, zero-copy parsing |
Cache Lookups | 150μs | 12μs | 12.5x | Lock-free data structures |
Performance Best Practices
Measurement Strategy
- Profile production workloads
- Use realistic data patterns
- Measure multiple dimensions
- Consider percentiles over averages
- Validate in production environment
Optimization Priorities
- Algorithm complexity improvements
- Hot path memory allocation reduction
- I/O and concurrency optimization
- Data structure selection
- Micro-optimizations (last resort)
Go-Specific Techniques
- Leverage escape analysis knowledge
- Use sync.Pool for frequent allocations
- Understand GC tuning parameters
- Profile goroutine scheduling
- Consider unsafe for critical paths
Quick Performance Wins
- Pre-allocate slices: make([]T, 0, expectedSize)
- Use strings.Builder: For string concatenation
- Buffer I/O: bufio.Reader/Writer wrappers
- Reuse objects: sync.Pool for temporary allocations
- Choose strconv: Over fmt for conversions
Performance Anti-Patterns
- Premature optimization: Optimize without measuring first
- Micro-benchmark tunnel vision: Ignoring real-world patterns
- Memory vs CPU trade-off blindness: Not considering all resources
- Optimization for the wrong bottleneck: Focus on proven hot paths
- Readability sacrifice: Complex code for marginal gains
Performance Challenges
Hands-On Performance Projects
Master performance optimization through these challenging real-world scenarios:
1. High-Throughput Logger
Design a logging system handling 1M+ entries/second with minimal GC pressure and consistent latency.
Beginner Lock-Free2. Memory-Efficient Cache
Build an LRU cache supporting millions of entries with < 50 bytes overhead per entry and concurrent access.
Intermediate Concurrent3. Ultra-Fast JSON Parser
Create a domain-specific JSON parser 10x faster than encoding/json for your data schema.
Advanced Zero-Copy4. Distributed Load Balancer
Implement a load balancer handling 100k+ connections with < 1ms routing latency and minimal memory per connection.
Expert Network