Strings and Runes in Go

📚 Comprehensive Guide ⏱️ 30 min read 🎯 Intermediate Level

Understanding Strings in Go

String Internals

Strings in Go are immutable sequences of bytes, typically representing UTF-8 encoded text. Under the hood, a string is a struct containing a pointer to an underlying byte array and a length. This design makes strings both memory-efficient and safe for concurrent access.

String Structure

  • Immutable: Once created, string content cannot be changed
  • UTF-8 by default: Go source code is UTF-8, string literals are UTF-8
  • Slice-like: Strings support slicing with byte indices
  • Comparable: Strings can be compared with == and < operators
// String internals (conceptual)
type stringStruct struct {
    ptr *byte  // Pointer to underlying bytes
    len int    // Length in bytes (not runes!)
}

String Basics

Creating and Manipulating Strings

Go provides multiple ways to create strings, from literals to conversions. Understanding the difference between byte length and rune count is crucial for correct string handling.

// String literals
s1 := "Hello, World!"           // Interpreted string
s2 := `Line 1
Line 2`                          // Raw string (preserves newlines)

// Escape sequences in interpreted strings
escaped := "Tab:\t Quote:\" Newline:\n Unicode:\u4e16"

// String concatenation
greeting := "Hello" + ", " + "World"
repeated := strings.Repeat("Go", 3)  // "GoGoGo"

// String length (bytes vs runes)
ascii := "Hello"
chinese := "世界"
emoji := "Hello 👋"

fmt.Println(len(ascii))        // 5 bytes
fmt.Println(len(chinese))      // 6 bytes (3 bytes per character)
fmt.Println(len(emoji))        // 10 bytes (emoji is 4 bytes)

fmt.Println(utf8.RuneCountInString(ascii))   // 5 runes
fmt.Println(utf8.RuneCountInString(chinese)) // 2 runes
fmt.Println(utf8.RuneCountInString(emoji))   // 7 runes

String Immutability

Strings cannot be modified in place. Any operation that appears to modify a string actually creates a new string. This guarantees thread safety but requires awareness of performance implications.

⚠️ Immutability Implications

  • String concatenation in loops can be inefficient
  • Use strings.Builder for multiple concatenations
  • Convert to []byte for in-place modifications
  • Substring operations share underlying memory
// Strings are immutable
s := "hello"
// s[0] = 'H'  // Compile error!

// Create new string instead
s = "H" + s[1:]  // "Hello"

// For mutable operations, use []byte
b := []byte(s)
b[0] = 'J'
s = string(b)  // "Jello"

// Efficient string building
var builder strings.Builder
for i := 0; i < 1000; i++ {
    builder.WriteString("Go")
}
result := builder.String()  // Efficient!

Runes and Unicode

Understanding Runes

A rune is Go's type for a Unicode code point, aliased to int32. Runes allow you to work with individual Unicode characters regardless of their byte representation. This is essential for proper internationalization and text processing.

Rune Facts

  • Type rune is an alias for int32
  • Represents a single Unicode code point
  • Can represent any Unicode character (over 1 million possible values)
  • Rune literals use single quotes: 'A', '世', '👋'
// Rune basics
var r1 rune = 'A'          // 65
var r2 rune = '世'         // 19990
var r3 rune = '👋'         // 128075
var r4 rune = '\n'         // 10 (newline)
var r5 rune = '\u4e16'     // 19990 (世 in Unicode)

// Rune to string conversion
s := string(r2)             // "世"
s = string([]rune{r1, r2, r3}) // "A世👋"

// String to rune slice
text := "Hello, 世界"
runes := []rune(text)
fmt.Println(len(runes))     // 9 runes
fmt.Println(len(text))      // 13 bytes

// Iterate over runes
for i, r := range text {
    fmt.Printf("%d: %c (%U)\n", i, r, r)
}
// Output:
// 0: H (U+0048)
// 1: e (U+0065)
// 2: l (U+006C)
// 3: l (U+006C)
// 4: o (U+006F)
// 5: , (U+002C)
// 6:   (U+0020)
// 7: 世 (U+4E16)
// 10: 界 (U+754C)

UTF-8 Encoding

How UTF-8 Works

UTF-8 is a variable-length encoding where ASCII characters use 1 byte, and other characters use 2-4 bytes. Go's native UTF-8 support makes it excellent for international applications.

UTF-8 Encoding Rules

  • 1 byte: U+0000 to U+007F (ASCII)
  • 2 bytes: U+0080 to U+07FF
  • 3 bytes: U+0800 to U+FFFF (most common characters)
  • 4 bytes: U+10000 to U+10FFFF (emoji, rare characters)
// UTF-8 encoding examples
func examineUTF8(s string) {
    fmt.Printf("String: %s\n", s)
    fmt.Printf("Bytes: % x\n", []byte(s))
    fmt.Printf("Byte count: %d\n", len(s))
    fmt.Printf("Rune count: %d\n", utf8.RuneCountInString(s))
    
    // Decode UTF-8 manually
    for i := 0; i < len(s); {
        r, size := utf8.DecodeRuneInString(s[i:])
        fmt.Printf("  %c: %d bytes\n", r, size)
        i += size
    }
}

examineUTF8("A")     // 1 byte:  41
examineUTF8("€")     // 3 bytes: e2 82 ac
examineUTF8("世")    // 3 bytes: e4 b8 96
examineUTF8("👋")    // 4 bytes: f0 9f 91 8b

// Validate UTF-8
valid := utf8.ValidString("Hello, 世界")  // true
invalid := []byte{0xff, 0xfe, 0xfd}
valid = utf8.Valid(invalid)               // false

String Operations

Common String Functions

The strings package provides a rich set of functions for string manipulation. These functions are optimized and handle UTF-8 correctly.

// Searching and checking
s := "Hello, Go Programming"

contains := strings.Contains(s, "Go")        // true
hasPrefix := strings.HasPrefix(s, "Hello")  // true
hasSuffix := strings.HasSuffix(s, "ing")    // true
index := strings.Index(s, "Go")             // 7
lastIndex := strings.LastIndex(s, "o")      // 11
count := strings.Count(s, "o")              // 2

// Transformation
upper := strings.ToUpper(s)                  // "HELLO, GO PROGRAMMING"
lower := strings.ToLower(s)                  // "hello, go programming"
title := strings.Title(s)                    // "Hello, Go Programming"
trimmed := strings.TrimSpace("  hello  ")  // "hello"

// Replacement
replaced := strings.Replace(s, "o", "0", -1)  // "Hell0, G0 Pr0gramming"
replacedN := strings.Replace(s, "o", "0", 2) // "Hell0, G0 Programming"

// Splitting and joining
parts := strings.Split(s, ", ")             // ["Hello", "Go Programming"]
fields := strings.Fields(s)                  // ["Hello,", "Go", "Programming"]
joined := strings.Join(parts, " | ")        // "Hello | Go Programming"

String Builder for Efficiency

When building strings dynamically, especially in loops, strings.Builder provides much better performance than repeated concatenation.

// Inefficient string concatenation
func inefficientConcat(words []string) string {
    result := ""
    for _, word := range words {
        result += word + " "  // Creates new string each time!
    }
    return result
}

// Efficient with strings.Builder
func efficientConcat(words []string) string {
    var builder strings.Builder
    builder.Grow(len(words) * 10)  // Pre-allocate capacity
    
    for _, word := range words {
        builder.WriteString(word)
        builder.WriteByte(' ')
    }
    return builder.String()
}

// Builder methods
var b strings.Builder
b.WriteString("Hello")        // Write string
b.WriteByte(' ')              // Write single byte
b.WriteRune('世')             // Write rune
b.Write([]byte{'!'})         // Write byte slice
fmt.Println(b.String())       // "Hello 世!"
fmt.Println(b.Len())          // 10 bytes
fmt.Println(b.Cap())          // Capacity

String Conversions

Between Strings and Other Types

Converting between strings and other types is common in Go. Understanding the cost and semantics of these conversions is important for writing efficient code.

// String to/from []byte
s := "Hello"
bytes := []byte(s)           // Allocates new slice
s2 := string(bytes)          // Allocates new string

// String to/from []rune
runes := []rune(s)           // Useful for character-level ops
s3 := string(runes)

// Number conversions with strconv
import "strconv"

// String to numbers
i, err := strconv.Atoi("123")              // 123
i64, err := strconv.ParseInt("123", 10, 64) // base 10, 64-bit
f64, err := strconv.ParseFloat("3.14", 64)  // 3.14
b, err := strconv.ParseBool("true")         // true

// Numbers to string
s = strconv.Itoa(123)                        // "123"
s = strconv.FormatInt(123, 10)              // "123" (base 10)
s = strconv.FormatFloat(3.14, 'f', 2, 64)  // "3.14"
s = strconv.FormatBool(true)                // "true"

// Quote and unquote
quoted := strconv.Quote("Hello\nWorld")     // "\"Hello\\nWorld\""
unquoted, err := strconv.Unquote(quoted)     // "Hello\nWorld"

Advanced String Techniques

String Interning

String interning can reduce memory usage when dealing with many duplicate strings. While Go doesn't have built-in interning, you can implement it using maps.

// Simple string interning
type StringIntern struct {
    mu    sync.RWMutex
    table map[string]string
}

func NewStringIntern() *StringIntern {
    return &StringIntern{
        table: make(map[string]string),
    }
}

func (si *StringIntern) Intern(s string) string {
    si.mu.RLock()
    if interned, ok := si.table[s]; ok {
        si.mu.RUnlock()
        return interned
    }
    si.mu.RUnlock()
    
    si.mu.Lock()
    defer si.mu.Unlock()
    
    // Double-check after acquiring write lock
    if interned, ok := si.table[s]; ok {
        return interned
    }
    
    si.table[s] = s
    return s
}

Regular Expressions

Go's regexp package provides powerful pattern matching capabilities. Compile patterns once and reuse them for better performance.

import "regexp"

// Compile patterns (do this once)
emailRegex := regexp.MustCompile(`^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`)
phoneRegex := regexp.MustCompile(`^\d{3}-\d{3}-\d{4}$`)

// Pattern matching
isEmail := emailRegex.MatchString("user@example.com")  // true
isPhone := phoneRegex.MatchString("123-456-7890")     // true

// Find matches
text := "Contact: john@example.com or jane@example.org"
matches := emailRegex.FindAllString(text, -1)
// ["john@example.com", "jane@example.org"]

// Replace with regex
re := regexp.MustCompile(`\b(\w+)@(\w+\.\w+)\b`)
result := re.ReplaceAllString(text, "$1 [at] $2")
// "Contact: john [at] example.com or jane [at] example.org"

// Submatches
re = regexp.MustCompile(`(\d{4})-(\d{2})-(\d{2})`)
match := re.FindStringSubmatch("2024-03-15")
// ["2024-03-15", "2024", "03", "15"]

Performance Comparison

Operation String []byte []rune Notes
Indexing O(1) bytes O(1) O(1) String indexing returns bytes, not runes
Length O(1) bytes O(1) O(1) String len() returns byte count
Iteration O(n) runes O(n) O(n) Range on string decodes UTF-8
Concatenation O(n+m) O(1) append O(1) append String concat allocates new string
Modification Impossible O(1) O(1) Strings are immutable
Memory Compact 1 byte/char 4 bytes/char UTF-8 vs fixed-width

Best Practices

String and Rune Guidelines

  • Use strings for text: Default to strings for human-readable text
  • Use []byte for I/O: Network and file operations often use []byte
  • Use []rune for Unicode: When you need character-level operations
  • Validate UTF-8: Always validate external input
  • Preallocate builders: Use Grow() when size is known
  • Cache regex: Compile patterns once, reuse many times

Common Pitfalls

⚠️ String Gotchas

  • Byte vs Rune indexing: s[i] returns byte, not rune
  • Substring memory: Substrings keep entire string in memory
  • Range iteration: Index jumps by rune byte size
  • Invalid UTF-8: Can cause unexpected behavior
  • Concatenation performance: Use Builder in loops
// Pitfall: Substring memory retention
func getFirstWord(s string) string {
    i := strings.Index(s, " ")
    if i == -1 {
        return s
    }
    // This keeps entire string in memory!
    return s[:i]
}

// Solution: Copy to new string
func getFirstWordCopy(s string) string {
    i := strings.Index(s, " ")
    if i == -1 {
        return s
    }
    return string([]byte(s[:i]))  // Forces copy
}

// Pitfall: Modifying string during iteration
s := "Hello"
for i, r := range s {
    // i is byte index, not rune index!
    fmt.Printf("Byte %d: %c\n", i, r)
}

// Correct character indexing
runes := []rune(s)
for i, r := range runes {
    fmt.Printf("Char %d: %c\n", i, r)
}

Unicode Normalization

Handling Unicode Equivalence

Unicode normalization ensures that equivalent strings have the same representation. This is crucial for string comparison and searching in international applications.

import "golang.org/x/text/unicode/norm"

// Different representations of "café"
s1 := "café"      // é as single character (U+00E9)
s2 := "café"      // e + ́ combining accent (U+0065 U+0301)

fmt.Println(s1 == s2)                    // false!
fmt.Println(len(s1), len(s2))          // Different byte lengths

// Normalize to NFC (Canonical Composition)
n1 := norm.NFC.String(s1)
n2 := norm.NFC.String(s2)
fmt.Println(n1 == n2)                    // true

// Normalize for comparison
func equalFold(s1, s2 string) bool {
    return norm.NFC.String(s1) == norm.NFC.String(s2)
}

// Case-insensitive comparison with normalization
func equalFoldCase(s1, s2 string) bool {
    s1 = strings.ToLower(norm.NFC.String(s1))
    s2 = strings.ToLower(norm.NFC.String(s2))
    return s1 == s2
}

🎯 Practice Exercises

Exercise 1: Unicode Text Processor

Build a text processor that counts words, characters, and bytes in multiple languages.

Exercise 2: String Template Engine

Create a simple template engine that replaces placeholders with values efficiently.

Exercise 3: Text Search with Highlighting

Implement case-insensitive search that preserves original case when highlighting matches.

Exercise 4: CSV Parser

Write a CSV parser that handles quoted fields, escapes, and UTF-8 correctly.