Performance Fundamentals
Performance in Go
Go performance optimization focuses on three key areas: CPU efficiency, memory management, and concurrency. Understanding Go's runtime behavior is crucial for building high-performance applications.
Measurement First
Always measure performance before optimizing. Use profiling tools to identify real bottlenecks rather than guessing where problems might be.
Profile BenchmarkRuntime Behavior
Understanding Go's garbage collector, scheduler, and memory allocator behavior is key to writing performant applications.
GC-Aware SchedulerOptimization Strategy
Focus on algorithmic improvements first, then micro-optimizations. Maintain code readability while improving performance.
Algorithm ReadabilityProfiling Go Applications
Use Go's built-in profiling tools to identify performance bottlenecks.
CPU Profiling
package main
import (
"flag"
"log"
"os"
"runtime/pprof"
)
var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to file")
func main() {
flag.Parse()
if *cpuprofile != "" {
f, err := os.Create(*cpuprofile)
if err != nil {
log.Fatal(err)
}
defer f.Close()
if err := pprof.StartCPUProfile(f); err != nil {
log.Fatal(err)
}
defer pprof.StopCPUProfile()
}
// Your application code here
doWork()
}
// Analyze profile:
// go tool pprof cpu.prof
// (pprof) top
// (pprof) list functionName
// (pprof) web
// Memory profiling
import _ "net/http/pprof"
func setupProfiling() {
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
}
// Access profiles:
// go tool pprof http://localhost:6060/debug/pprof/heap
// go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
// go tool pprof http://localhost:6060/debug/pprof/goroutine
Memory Optimization
Techniques for reducing memory allocation and garbage collection pressure.
// Avoid unnecessary allocations
// Bad: Creates new slice each call
func badConcat(strs []string) string {
result := ""
for _, s := range strs {
result += s // Allocates new string each time
}
return result
}
// Good: Use strings.Builder
func goodConcat(strs []string) string {
var builder strings.Builder
for _, s := range strs {
builder.WriteString(s)
}
return builder.String()
}
// Preallocate slices
// Bad
func badSlice() []int {
var result []int
for i := 0; i < 1000; i++ {
result = append(result, i) // Multiple reallocations
}
return result
}
// Good
func goodSlice() []int {
result := make([]int, 0, 1000) // Preallocate capacity
for i := 0; i < 1000; i++ {
result = append(result, i)
}
return result
}
// Object pooling
var bufferPool = sync.Pool{
New: func() interface{} {
return make([]byte, 1024)
},
}
func processWithPool(data []byte) {
buf := bufferPool.Get().([]byte)
defer bufferPool.Put(buf)
// Use buffer
copy(buf, data)
// Process...
}
// Reduce allocations in hot paths
type Stats struct {
count int64
sum int64
}
// Bad: Returns new struct (allocation)
func (s Stats) Add(value int64) Stats {
return Stats{
count: s.count + 1,
sum: s.sum + value,
}
}
// Good: Modify in place (no allocation)
func (s *Stats) AddInPlace(value int64) {
s.count++
s.sum += value
}
// String interning for repeated strings
var internedStrings = make(map[string]string)
var internMu sync.RWMutex
func intern(s string) string {
internMu.RLock()
if interned, ok := internedStrings[s]; ok {
internMu.RUnlock()
return interned
}
internMu.RUnlock()
internMu.Lock()
internedStrings[s] = s
internMu.Unlock()
return s
}
Concurrency Optimization
Optimize concurrent code for better performance.
// Optimal goroutine pool size
func optimalWorkers() int {
return runtime.NumCPU()
}
// Batching for reduced contention
type BatchProcessor struct {
batch []Item
batchSize int
mu sync.Mutex
process func([]Item)
}
func (b *BatchProcessor) Add(item Item) {
b.mu.Lock()
b.batch = append(b.batch, item)
if len(b.batch) >= b.batchSize {
batch := b.batch
b.batch = make([]Item, 0, b.batchSize)
b.mu.Unlock()
go b.process(batch)
} else {
b.mu.Unlock()
}
}
// Lock-free data structures
type LockFreeCounter struct {
value int64
}
func (c *LockFreeCounter) Increment() {
atomic.AddInt64(&c.value, 1)
}
func (c *LockFreeCounter) Get() int64 {
return atomic.LoadInt64(&c.value)
}
// Channel optimization
// Bad: Unbuffered channel causes blocking
func badChannel() {
ch := make(chan int)
go produce(ch)
consume(ch)
}
// Good: Buffered channel reduces contention
func goodChannel() {
ch := make(chan int, 100)
go produce(ch)
consume(ch)
}
// Reduce lock granularity
type ShardedMap struct {
shards [16]shard
}
type shard struct {
mu sync.RWMutex
items map[string]interface{}
}
func (m *ShardedMap) getShard(key string) *shard {
hash := fnv32(key)
return &m.shards[hash%16]
}
func (m *ShardedMap) Set(key string, value interface{}) {
shard := m.getShard(key)
shard.mu.Lock()
shard.items[key] = value
shard.mu.Unlock()
}
func (m *ShardedMap) Get(key string) (interface{}, bool) {
shard := m.getShard(key)
shard.mu.RLock()
val, ok := shard.items[key]
shard.mu.RUnlock()
return val, ok
}
Benchmarking
Write effective benchmarks to measure performance improvements.
// Basic benchmark
func BenchmarkFunction(b *testing.B) {
for i := 0; i < b.N; i++ {
// Code to benchmark
result := expensiveOperation()
_ = result // Prevent optimization
}
}
// Benchmark with setup
func BenchmarkWithSetup(b *testing.B) {
// Setup code (not timed)
data := generateTestData()
b.ResetTimer() // Reset timer after setup
for i := 0; i < b.N; i++ {
processData(data)
}
}
// Parallel benchmark
func BenchmarkParallel(b *testing.B) {
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
// Concurrent operation
doWork()
}
})
}
// Sub-benchmarks
func BenchmarkSizes(b *testing.B) {
sizes := []int{10, 100, 1000, 10000}
for _, size := range sizes {
b.Run(fmt.Sprintf("size-%d", size), func(b *testing.B) {
data := make([]int, size)
b.ResetTimer()
for i := 0; i < b.N; i++ {
sort.Ints(data)
}
})
}
}
// Memory allocation benchmark
func BenchmarkAllocation(b *testing.B) {
b.ReportAllocs() // Report allocation statistics
for i := 0; i < b.N; i++ {
s := make([]int, 100)
_ = s
}
}
// Custom metrics
func BenchmarkCustom(b *testing.B) {
var totalBytes int64
for i := 0; i < b.N; i++ {
data := processFile()
totalBytes += int64(len(data))
}
b.SetBytes(totalBytes / int64(b.N))
}
// Run benchmarks:
// go test -bench=.
// go test -bench=. -benchmem
// go test -bench=. -benchtime=10s
// go test -bench=. -cpu=1,2,4,8
Optimization Techniques
Specific techniques for optimizing Go code.
// Bounds check elimination
func sum(nums []int) int {
if len(nums) == 0 {
return 0
}
total := 0
// Compiler can eliminate bounds checks
for i := range nums {
total += nums[i]
}
return total
}
// Escape analysis optimization
// go build -gcflags="-m"
// Stack allocation (doesn't escape)
func stackAlloc() int {
x := 42 // Allocated on stack
return x
}
// Heap allocation (escapes)
func heapAlloc() *int {
x := 42 // Escapes to heap
return &x
}
// Inlining
// Small functions are inlined automatically
func add(a, b int) int { // Will be inlined
return a + b
}
// Prevent inlining with //go:noinline
//go:noinline
func noInline(x int) int {
return x * 2
}
// Fast paths for common cases
func processValue(v interface{}) string {
// Fast path for common types
switch val := v.(type) {
case string:
return val
case int:
return strconv.Itoa(val)
default:
// Slow path for other types
return fmt.Sprintf("%v", v)
}
}
// Table-driven alternatives
var dayNames = [7]string{
"Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday",
}
func getDayName(day int) string {
if day < 0 || day > 6 {
return "Invalid"
}
return dayNames[day] // Faster than switch
}
// SIMD-friendly code
func vectorAdd(a, b []float64) []float64 {
n := len(a)
result := make([]float64, n)
// Process in chunks for potential SIMD
for i := 0; i < n-3; i += 4 {
result[i] = a[i] + b[i]
result[i+1] = a[i+1] + b[i+1]
result[i+2] = a[i+2] + b[i+2]
result[i+3] = a[i+3] + b[i+3]
}
// Handle remainder
for i := n - n%4; i < n; i++ {
result[i] = a[i] + b[i]
}
return result
}
Advanced Optimization Patterns
Advanced techniques for achieving maximum performance in Go applications.
// Zero-allocation string builder using unsafe
type FastBuilder struct {
buf []byte
}
func (fb *FastBuilder) WriteString(s string) {
fb.buf = append(fb.buf, s...)
}
func (fb *FastBuilder) String() string {
return *(*string)(unsafe.Pointer(&fb.buf))
}
// Memory pool for hot path allocations
type BufferPool struct {
pool sync.Pool
}
func NewBufferPool() *BufferPool {
return &BufferPool{
pool: sync.Pool{
New: func() interface{} {
return make([]byte, 0, 1024)
},
},
}
}
func (bp *BufferPool) Get() []byte {
return bp.pool.Get().([]byte)[:0]
}
func (bp *BufferPool) Put(buf []byte) {
if cap(buf) <= 16384 { // Prevent memory leaks
bp.pool.Put(buf)
}
}
// Lock-free counter using atomic operations
type AtomicCounter struct {
value int64
}
func (ac *AtomicCounter) Increment() int64 {
return atomic.AddInt64(&ac.value, 1)
}
func (ac *AtomicCounter) Get() int64 {
return atomic.LoadInt64(&ac.value)
}
// High-performance hash map with linear probing
type FastMap struct {
keys []string
values []interface{}
mask int
}
func NewFastMap(size int) *FastMap {
// Round up to next power of 2
size = int(1 << uint(64-bits.LeadingZeros(uint(size-1))))
return &FastMap{
keys: make([]string, size),
values: make([]interface{}, size),
mask: size - 1,
}
}
func (fm *FastMap) hash(key string) int {
h := fnv.New32a()
h.Write([]byte(key))
return int(h.Sum32()) & fm.mask
}
Performance Case Studies
Real-World Optimization Examples
Learn from actual performance improvements achieved through systematic profiling and optimization.
Before: Slow JSON Processing
// Inefficient: Multiple passes, allocations
func ProcessJSONSlow(data []byte) ([]Result, error) {
var items []map[string]interface{}
if err := json.Unmarshal(data, &items); err != nil {
return nil, err
}
results := []Result{}
for _, item := range items {
if name, ok := item["name"].(string); ok {
results = append(results, Result{Name: name})
}
}
return results, nil
}
Performance: 850ms for 100k records
After: Optimized Processing
// Efficient: Direct unmarshaling, pre-allocated
type JSONItem struct {
Name string `json:"name"`
}
func ProcessJSONFast(data []byte) ([]Result, error) {
var items []JSONItem
if err := json.Unmarshal(data, &items); err != nil {
return nil, err
}
results := make([]Result, 0, len(items))
for _, item := range items {
results = append(results, Result{Name: item.Name})
}
return results, nil
}
Performance: 95ms for 100k records (9x faster)
| Optimization | Before | After | Improvement | Key Technique |
|---|---|---|---|---|
| HTTP Handler | 45k req/s | 165k req/s | 3.7x | Buffer pooling, reduced allocations |
| Database Queries | 2.8s | 280ms | 10x | Connection pooling, prepared statements |
| CSV Processing | 4.5s | 520ms | 8.7x | Memory mapping, zero-copy parsing |
| Cache Lookups | 150μs | 12μs | 12.5x | Lock-free data structures |
Performance Best Practices
Measurement Strategy
- Profile production workloads
- Use realistic data patterns
- Measure multiple dimensions
- Consider percentiles over averages
- Validate in production environment
Optimization Priorities
- Algorithm complexity improvements
- Hot path memory allocation reduction
- I/O and concurrency optimization
- Data structure selection
- Micro-optimizations (last resort)
Go-Specific Techniques
- Leverage escape analysis knowledge
- Use sync.Pool for frequent allocations
- Understand GC tuning parameters
- Profile goroutine scheduling
- Consider unsafe for critical paths
Quick Performance Wins
- Pre-allocate slices: make([]T, 0, expectedSize)
- Use strings.Builder: For string concatenation
- Buffer I/O: bufio.Reader/Writer wrappers
- Reuse objects: sync.Pool for temporary allocations
- Choose strconv: Over fmt for conversions
Performance Anti-Patterns
- Premature optimization: Optimize without measuring first
- Micro-benchmark tunnel vision: Ignoring real-world patterns
- Memory vs CPU trade-off blindness: Not considering all resources
- Optimization for the wrong bottleneck: Focus on proven hot paths
- Readability sacrifice: Complex code for marginal gains
Performance Challenges
Hands-On Performance Projects
Master performance optimization through these challenging real-world scenarios:
1. High-Throughput Logger
Design a logging system handling 1M+ entries/second with minimal GC pressure and consistent latency.
Beginner Lock-Free2. Memory-Efficient Cache
Build an LRU cache supporting millions of entries with < 50 bytes overhead per entry and concurrent access.
Intermediate Concurrent3. Ultra-Fast JSON Parser
Create a domain-specific JSON parser 10x faster than encoding/json for your data schema.
Advanced Zero-Copy4. Distributed Load Balancer
Implement a load balancer handling 100k+ connections with < 1ms routing latency and minimal memory per connection.
Expert Network