Understanding Strings in Go
String Internals
Strings in Go are immutable sequences of bytes, typically representing UTF-8 encoded text. Under the hood, a string is a struct containing a pointer to an underlying byte array and a length. This design makes strings both memory-efficient and safe for concurrent access.
String Structure
- Immutable: Once created, string content cannot be changed
- UTF-8 by default: Go source code is UTF-8, string literals are UTF-8
- Slice-like: Strings support slicing with byte indices
- Comparable: Strings can be compared with == and < operators
// String internals (conceptual) type stringStruct struct { ptr *byte // Pointer to underlying bytes len int // Length in bytes (not runes!) }
String Basics
Creating and Manipulating Strings
Go provides multiple ways to create strings, from literals to conversions. Understanding the difference between byte length and rune count is crucial for correct string handling.
// String literals s1 := "Hello, World!" // Interpreted string s2 := `Line 1 Line 2` // Raw string (preserves newlines) // Escape sequences in interpreted strings escaped := "Tab:\t Quote:\" Newline:\n Unicode:\u4e16" // String concatenation greeting := "Hello" + ", " + "World" repeated := strings.Repeat("Go", 3) // "GoGoGo" // String length (bytes vs runes) ascii := "Hello" chinese := "世界" emoji := "Hello 👋" fmt.Println(len(ascii)) // 5 bytes fmt.Println(len(chinese)) // 6 bytes (3 bytes per character) fmt.Println(len(emoji)) // 10 bytes (emoji is 4 bytes) fmt.Println(utf8.RuneCountInString(ascii)) // 5 runes fmt.Println(utf8.RuneCountInString(chinese)) // 2 runes fmt.Println(utf8.RuneCountInString(emoji)) // 7 runes
String Immutability
Strings cannot be modified in place. Any operation that appears to modify a string actually creates a new string. This guarantees thread safety but requires awareness of performance implications.
⚠️ Immutability Implications
- String concatenation in loops can be inefficient
- Use
strings.Builder
for multiple concatenations - Convert to
[]byte
for in-place modifications - Substring operations share underlying memory
// Strings are immutable s := "hello" // s[0] = 'H' // Compile error! // Create new string instead s = "H" + s[1:] // "Hello" // For mutable operations, use []byte b := []byte(s) b[0] = 'J' s = string(b) // "Jello" // Efficient string building var builder strings.Builder for i := 0; i < 1000; i++ { builder.WriteString("Go") } result := builder.String() // Efficient!
Runes and Unicode
Understanding Runes
A rune is Go's type for a Unicode code point, aliased to int32
. Runes allow
you to work with individual Unicode characters regardless of their byte representation.
This is essential for proper internationalization and text processing.
Rune Facts
- Type
rune
is an alias forint32
- Represents a single Unicode code point
- Can represent any Unicode character (over 1 million possible values)
- Rune literals use single quotes:
'A'
,'世'
,'👋'
// Rune basics var r1 rune = 'A' // 65 var r2 rune = '世' // 19990 var r3 rune = '👋' // 128075 var r4 rune = '\n' // 10 (newline) var r5 rune = '\u4e16' // 19990 (世 in Unicode) // Rune to string conversion s := string(r2) // "世" s = string([]rune{r1, r2, r3}) // "A世👋" // String to rune slice text := "Hello, 世界" runes := []rune(text) fmt.Println(len(runes)) // 9 runes fmt.Println(len(text)) // 13 bytes // Iterate over runes for i, r := range text { fmt.Printf("%d: %c (%U)\n", i, r, r) } // Output: // 0: H (U+0048) // 1: e (U+0065) // 2: l (U+006C) // 3: l (U+006C) // 4: o (U+006F) // 5: , (U+002C) // 6: (U+0020) // 7: 世 (U+4E16) // 10: 界 (U+754C)
UTF-8 Encoding
How UTF-8 Works
UTF-8 is a variable-length encoding where ASCII characters use 1 byte, and other characters use 2-4 bytes. Go's native UTF-8 support makes it excellent for international applications.
UTF-8 Encoding Rules
- 1 byte: U+0000 to U+007F (ASCII)
- 2 bytes: U+0080 to U+07FF
- 3 bytes: U+0800 to U+FFFF (most common characters)
- 4 bytes: U+10000 to U+10FFFF (emoji, rare characters)
// UTF-8 encoding examples func examineUTF8(s string) { fmt.Printf("String: %s\n", s) fmt.Printf("Bytes: % x\n", []byte(s)) fmt.Printf("Byte count: %d\n", len(s)) fmt.Printf("Rune count: %d\n", utf8.RuneCountInString(s)) // Decode UTF-8 manually for i := 0; i < len(s); { r, size := utf8.DecodeRuneInString(s[i:]) fmt.Printf(" %c: %d bytes\n", r, size) i += size } } examineUTF8("A") // 1 byte: 41 examineUTF8("€") // 3 bytes: e2 82 ac examineUTF8("世") // 3 bytes: e4 b8 96 examineUTF8("👋") // 4 bytes: f0 9f 91 8b // Validate UTF-8 valid := utf8.ValidString("Hello, 世界") // true invalid := []byte{0xff, 0xfe, 0xfd} valid = utf8.Valid(invalid) // false
String Operations
Common String Functions
The strings
package provides a rich set of functions for string manipulation.
These functions are optimized and handle UTF-8 correctly.
// Searching and checking s := "Hello, Go Programming" contains := strings.Contains(s, "Go") // true hasPrefix := strings.HasPrefix(s, "Hello") // true hasSuffix := strings.HasSuffix(s, "ing") // true index := strings.Index(s, "Go") // 7 lastIndex := strings.LastIndex(s, "o") // 11 count := strings.Count(s, "o") // 2 // Transformation upper := strings.ToUpper(s) // "HELLO, GO PROGRAMMING" lower := strings.ToLower(s) // "hello, go programming" title := strings.Title(s) // "Hello, Go Programming" trimmed := strings.TrimSpace(" hello ") // "hello" // Replacement replaced := strings.Replace(s, "o", "0", -1) // "Hell0, G0 Pr0gramming" replacedN := strings.Replace(s, "o", "0", 2) // "Hell0, G0 Programming" // Splitting and joining parts := strings.Split(s, ", ") // ["Hello", "Go Programming"] fields := strings.Fields(s) // ["Hello,", "Go", "Programming"] joined := strings.Join(parts, " | ") // "Hello | Go Programming"
String Builder for Efficiency
When building strings dynamically, especially in loops, strings.Builder
provides much better performance than repeated concatenation.
// Inefficient string concatenation func inefficientConcat(words []string) string { result := "" for _, word := range words { result += word + " " // Creates new string each time! } return result } // Efficient with strings.Builder func efficientConcat(words []string) string { var builder strings.Builder builder.Grow(len(words) * 10) // Pre-allocate capacity for _, word := range words { builder.WriteString(word) builder.WriteByte(' ') } return builder.String() } // Builder methods var b strings.Builder b.WriteString("Hello") // Write string b.WriteByte(' ') // Write single byte b.WriteRune('世') // Write rune b.Write([]byte{'!'}) // Write byte slice fmt.Println(b.String()) // "Hello 世!" fmt.Println(b.Len()) // 10 bytes fmt.Println(b.Cap()) // Capacity
String Conversions
Between Strings and Other Types
Converting between strings and other types is common in Go. Understanding the cost and semantics of these conversions is important for writing efficient code.
// String to/from []byte s := "Hello" bytes := []byte(s) // Allocates new slice s2 := string(bytes) // Allocates new string // String to/from []rune runes := []rune(s) // Useful for character-level ops s3 := string(runes) // Number conversions with strconv import "strconv" // String to numbers i, err := strconv.Atoi("123") // 123 i64, err := strconv.ParseInt("123", 10, 64) // base 10, 64-bit f64, err := strconv.ParseFloat("3.14", 64) // 3.14 b, err := strconv.ParseBool("true") // true // Numbers to string s = strconv.Itoa(123) // "123" s = strconv.FormatInt(123, 10) // "123" (base 10) s = strconv.FormatFloat(3.14, 'f', 2, 64) // "3.14" s = strconv.FormatBool(true) // "true" // Quote and unquote quoted := strconv.Quote("Hello\nWorld") // "\"Hello\\nWorld\"" unquoted, err := strconv.Unquote(quoted) // "Hello\nWorld"
Advanced String Techniques
String Interning
String interning can reduce memory usage when dealing with many duplicate strings. While Go doesn't have built-in interning, you can implement it using maps.
// Simple string interning type StringIntern struct { mu sync.RWMutex table map[string]string } func NewStringIntern() *StringIntern { return &StringIntern{ table: make(map[string]string), } } func (si *StringIntern) Intern(s string) string { si.mu.RLock() if interned, ok := si.table[s]; ok { si.mu.RUnlock() return interned } si.mu.RUnlock() si.mu.Lock() defer si.mu.Unlock() // Double-check after acquiring write lock if interned, ok := si.table[s]; ok { return interned } si.table[s] = s return s }
Regular Expressions
Go's regexp
package provides powerful pattern matching capabilities.
Compile patterns once and reuse them for better performance.
import "regexp" // Compile patterns (do this once) emailRegex := regexp.MustCompile(`^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`) phoneRegex := regexp.MustCompile(`^\d{3}-\d{3}-\d{4}$`) // Pattern matching isEmail := emailRegex.MatchString("user@example.com") // true isPhone := phoneRegex.MatchString("123-456-7890") // true // Find matches text := "Contact: john@example.com or jane@example.org" matches := emailRegex.FindAllString(text, -1) // ["john@example.com", "jane@example.org"] // Replace with regex re := regexp.MustCompile(`\b(\w+)@(\w+\.\w+)\b`) result := re.ReplaceAllString(text, "$1 [at] $2") // "Contact: john [at] example.com or jane [at] example.org" // Submatches re = regexp.MustCompile(`(\d{4})-(\d{2})-(\d{2})`) match := re.FindStringSubmatch("2024-03-15") // ["2024-03-15", "2024", "03", "15"]
Performance Comparison
Operation | String | []byte | []rune | Notes |
---|---|---|---|---|
Indexing | O(1) bytes | O(1) | O(1) | String indexing returns bytes, not runes |
Length | O(1) bytes | O(1) | O(1) | String len() returns byte count |
Iteration | O(n) runes | O(n) | O(n) | Range on string decodes UTF-8 |
Concatenation | O(n+m) | O(1) append | O(1) append | String concat allocates new string |
Modification | Impossible | O(1) | O(1) | Strings are immutable |
Memory | Compact | 1 byte/char | 4 bytes/char | UTF-8 vs fixed-width |
Best Practices
String and Rune Guidelines
- Use strings for text: Default to strings for human-readable text
- Use []byte for I/O: Network and file operations often use []byte
- Use []rune for Unicode: When you need character-level operations
- Validate UTF-8: Always validate external input
- Preallocate builders: Use Grow() when size is known
- Cache regex: Compile patterns once, reuse many times
Common Pitfalls
⚠️ String Gotchas
- Byte vs Rune indexing: s[i] returns byte, not rune
- Substring memory: Substrings keep entire string in memory
- Range iteration: Index jumps by rune byte size
- Invalid UTF-8: Can cause unexpected behavior
- Concatenation performance: Use Builder in loops
// Pitfall: Substring memory retention func getFirstWord(s string) string { i := strings.Index(s, " ") if i == -1 { return s } // This keeps entire string in memory! return s[:i] } // Solution: Copy to new string func getFirstWordCopy(s string) string { i := strings.Index(s, " ") if i == -1 { return s } return string([]byte(s[:i])) // Forces copy } // Pitfall: Modifying string during iteration s := "Hello" for i, r := range s { // i is byte index, not rune index! fmt.Printf("Byte %d: %c\n", i, r) } // Correct character indexing runes := []rune(s) for i, r := range runes { fmt.Printf("Char %d: %c\n", i, r) }
Unicode Normalization
Handling Unicode Equivalence
Unicode normalization ensures that equivalent strings have the same representation. This is crucial for string comparison and searching in international applications.
import "golang.org/x/text/unicode/norm" // Different representations of "café" s1 := "café" // é as single character (U+00E9) s2 := "café" // e + ́ combining accent (U+0065 U+0301) fmt.Println(s1 == s2) // false! fmt.Println(len(s1), len(s2)) // Different byte lengths // Normalize to NFC (Canonical Composition) n1 := norm.NFC.String(s1) n2 := norm.NFC.String(s2) fmt.Println(n1 == n2) // true // Normalize for comparison func equalFold(s1, s2 string) bool { return norm.NFC.String(s1) == norm.NFC.String(s2) } // Case-insensitive comparison with normalization func equalFoldCase(s1, s2 string) bool { s1 = strings.ToLower(norm.NFC.String(s1)) s2 = strings.ToLower(norm.NFC.String(s2)) return s1 == s2 }
🎯 Practice Exercises
Exercise 1: Unicode Text Processor
Build a text processor that counts words, characters, and bytes in multiple languages.
Exercise 2: String Template Engine
Create a simple template engine that replaces placeholders with values efficiently.
Exercise 3: Text Search with Highlighting
Implement case-insensitive search that preserves original case when highlighting matches.
Exercise 4: CSV Parser
Write a CSV parser that handles quoted fields, escapes, and UTF-8 correctly.