String scanning is the basic operation behind parsing, tokenization, validation, search, and text normalization.
String scanning is the basic operation behind parsing, tokenization, validation, search, and text normalization. The algorithm reads a string from left to right, maintains a small state, and decides what to do with each character or byte.
The main difference between string scanning and array traversal is representation. A string may be treated as bytes, Unicode code points, or higher level text units. The correct choice depends on the problem.
Problem
Given a string s, process its contents in order.
Examples include:
count digits
find the first separator
parse an integer
split into tokens
validate parenthesesA correct scanner must define what one step means. In Go, indexing a string gives bytes. Ranging over a string gives Unicode code points as rune values.
Byte Scanning
Use byte scanning when the input format is ASCII or byte-oriented.
func CountDigitsASCII(s string) int {
count := 0
for i := 0; i < len(s); i++ {
if s[i] >= '0' && s[i] <= '9' {
count++
}
}
return count
}Invariant:
Before iteration i, count equals the number of ASCII digits in s[0:i].This is fast and simple. It is correct for formats such as HTTP headers, JSON punctuation, base64, log keys, and protocol tokens when the grammar is byte based.
Rune Scanning
Use rune scanning when the operation depends on Unicode code points.
func CountRunes(s string) int {
count := 0
for range s {
count++
}
return count
}A range loop decodes UTF-8 and advances by the width of each encoded rune.
func CountUnicodeLetters(s string) int {
count := 0
for _, r := range s {
if unicode.IsLetter(r) {
count++
}
}
return count
}This requires:
import "unicode"Finding a Character
For byte-oriented separators:
func IndexByte(s string, sep byte) int {
for i := 0; i < len(s); i++ {
if s[i] == sep {
return i
}
}
return -1
}Invariant:
If the loop reaches index i, sep does not occur in s[0:i].For rune-oriented scanning:
func IndexRune(s string, target rune) int {
for i, r := range s {
if r == target {
return i
}
}
return -1
}The returned index is a byte offset, because Go strings are indexed by byte.
Parsing an Integer
A scanner often maintains a state variable.
func ParseNonnegativeInt(s string) (int, bool) {
if len(s) == 0 {
return 0, false
}
value := 0
for i := 0; i < len(s); i++ {
c := s[i]
if c < '0' || c > '9' {
return 0, false
}
value = value*10 + int(c-'0')
}
return value, true
}Invariant:
Before iteration i, value is the integer represented by s[0:i].For production code, add overflow checks before multiplication and addition.
Skipping Whitespace
Many scanners repeatedly skip ignored characters.
func SkipSpaces(s string, i int) int {
for i < len(s) && s[i] == ' ' {
i++
}
return i
}This byte version recognizes only ASCII space. For Unicode whitespace:
func SkipUnicodeSpaces(s string, i int) int {
for i < len(s) {
r, size := utf8.DecodeRuneInString(s[i:])
if !unicode.IsSpace(r) {
break
}
i += size
}
return i
}This requires:
import (
"unicode"
"unicode/utf8"
)Token Scanning
A tokenizer scans spans rather than individual characters.
func FieldsASCII(s string) []string {
var out []string
i := 0
for i < len(s) {
for i < len(s) && s[i] == ' ' {
i++
}
start := i
for i < len(s) && s[i] != ' ' {
i++
}
if start < i {
out = append(out, s[start:i])
}
}
return out
}Invariant:
Before each outer iteration, all complete fields ending before i have been emitted.This pattern is the basis for hand-written lexers.
State Machine Scanning
When the meaning of a character depends on context, use explicit states.
Example: validate a simple identifier: first character must be a letter or underscore, later characters may also be digits.
func IsIdentASCII(s string) bool {
if len(s) == 0 {
return false
}
first := s[0]
if !isLetterASCII(first) && first != '_' {
return false
}
for i := 1; i < len(s); i++ {
c := s[i]
if !isLetterASCII(c) && !isDigitASCII(c) && c != '_' {
return false
}
}
return true
}
func isLetterASCII(c byte) bool {
return ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z')
}
func isDigitASCII(c byte) bool {
return '0' <= c && c <= '9'
}The state is implicit: position 0 uses one rule, later positions use another.
Balanced Parentheses
String scanning often uses a small stack or counter.
For one bracket type, a counter is enough.
func BalancedParens(s string) bool {
depth := 0
for i := 0; i < len(s); i++ {
switch s[i] {
case '(':
depth++
case ')':
depth--
if depth < 0 {
return false
}
}
}
return depth == 0
}Invariant:
After scanning s[0:i], depth equals open parentheses minus closed parentheses.The check depth < 0 rejects a close parenthesis that has no matching open parenthesis.
Byte Offsets and Rune Counts
In Go, these are different:
s := "aπb"
fmt.Println(len(s)) // 4 bytes
fmt.Println(utf8.RuneCountInString(s)) // 3 runesThe character π uses two bytes in UTF-8. Therefore byte indices cannot always be used as character positions.
Use byte offsets for parsers, protocols, and slicing. Use rune counts for user-visible character counts when code points are sufficient.
Complexity
A single scan is linear in the number of bytes or runes inspected.
Time: O(n)
Space: O(1)Tokenization may allocate output slices or substrings:
Extra space: O(k)where k is the number of emitted tokens.
Common Pitfalls
Indexing a string by s[i] reads a byte, not a Unicode character.
Returning a rune loop index returns a byte offset, not a rune index.
Using Unicode-aware predicates such as unicode.IsLetter with byte scanning gives incorrect results.
Slicing at arbitrary byte offsets can split a UTF-8 encoding. This is acceptable for byte protocols but invalid for user text.
Ignoring overflow in integer parsing can silently produce wrong values.
Treating strings as mutable arrays is incorrect in Go. To modify text, convert to []byte or []rune, depending on the operation.
Takeaway
String scanning is traversal with a chosen text model. Use byte scanning for formats and protocols. Use rune scanning for Unicode code point logic. Make the scanner state explicit, maintain an invariant over the consumed prefix, and be precise about whether indices are byte offsets or character positions.