A clear explanation of validating a byte sequence as UTF-8 using bit masks and a continuation-byte counter.
Problem Restatement
We are given an integer array data.
Each integer represents one byte of data. Only the least significant 8 bits are used.
We need to decide whether this byte sequence is a valid UTF-8 encoding.
A UTF-8 character can be 1 to 4 bytes long.
The valid byte patterns are:
| Number of bytes | Pattern |
|---|---|
| 1 byte | 0xxxxxxx |
| 2 bytes | 110xxxxx 10xxxxxx |
| 3 bytes | 1110xxxx 10xxxxxx 10xxxxxx |
| 4 bytes | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
A continuation byte must start with:
10So a byte starting with 10 cannot begin a new character.
Input and Output
| Item | Meaning |
|---|---|
| Input | Integer array data |
| Output | True if it is valid UTF-8, otherwise False |
| Byte rule | Only the lowest 8 bits of each integer are used |
| Character length | 1 to 4 bytes |
Example function shape:
def validUtf8(data: list[int]) -> bool:
...Examples
Example 1:
data = [197, 130, 1]In binary:
197 = 11000101
130 = 10000010
1 = 00000001The first byte starts with 110, so it begins a 2-byte character.
The next byte starts with 10, so it is a valid continuation byte.
The last byte starts with 0, so it is a valid 1-byte character.
So the answer is:
TrueExample 2:
data = [235, 140, 4]In binary:
235 = 11101011
140 = 10001100
4 = 00000100The first byte starts with 1110, so it begins a 3-byte character.
The second byte starts with 10, so it is a valid continuation byte.
The third byte starts with 00, but a continuation byte must start with 10.
So the answer is:
FalseFirst Thought: Convert to Binary Strings
One direct solution is to convert every byte to an 8-character binary string.
Then we can check prefixes like:
"0"
"110"
"1110"
"11110"
"10"This is easy to read, but it does unnecessary string work.
Since the problem is about byte prefixes, bit operations fit better.
Key Insight
We scan bytes from left to right and keep one state:
remainingremaining means how many continuation bytes we still need.
If remaining == 0, the current byte must begin a new UTF-8 character.
If remaining > 0, the current byte must be a continuation byte starting with 10.
The byte prefixes can be checked with right shifts.
For one byte b:
| Check | Meaning |
|---|---|
b >> 7 == 0b0 | 0xxxxxxx, 1-byte character |
b >> 5 == 0b110 | 110xxxxx, start of 2-byte character |
b >> 4 == 0b1110 | 1110xxxx, start of 3-byte character |
b >> 3 == 0b11110 | 11110xxx, start of 4-byte character |
b >> 6 == 0b10 | 10xxxxxx, continuation byte |
Algorithm
Initialize:
remaining = 0For each number in data:
Keep only the lowest 8 bits:
byte = num & 0xFFIf
remaining == 0, this byte starts a new character:- If it starts with
0, it is a 1-byte character. - If it starts with
110, setremaining = 1. - If it starts with
1110, setremaining = 2. - If it starts with
11110, setremaining = 3. - Otherwise, return
False.
- If it starts with
If
remaining > 0, this byte must start with10.- If not, return
False. - Otherwise, decrement
remaining.
- If not, return
At the end, return whether:
remaining == 0If remaining > 0, the input ended in the middle of a multi-byte character.
Correctness
The variable remaining stores exactly how many continuation bytes are required to complete the current UTF-8 character.
When remaining == 0, the next byte must be the first byte of a character. The algorithm accepts exactly the valid first-byte prefixes: 0, 110, 1110, and 11110. It rejects invalid prefixes such as 10 and prefixes beginning with five or more 1 bits.
When a multi-byte starting byte is found, the algorithm sets remaining to the exact number of continuation bytes required by that character.
When remaining > 0, the current byte must be a continuation byte. UTF-8 continuation bytes must start with 10, and the algorithm checks exactly that condition. Each valid continuation byte reduces remaining by one.
If any required continuation byte is missing or malformed, the algorithm returns False.
After all bytes are processed, the encoding is valid only if no continuation bytes are still required. Therefore returning remaining == 0 is correct.
Complexity
Let n = len(data).
| Metric | Value | Why |
|---|---|---|
| Time | O(n) | Each byte is processed once |
| Space | O(1) | Only a counter is stored |
Implementation
class Solution:
def validUtf8(self, data: list[int]) -> bool:
remaining = 0
for num in data:
byte = num & 0xFF
if remaining == 0:
if byte >> 7 == 0b0:
remaining = 0
elif byte >> 5 == 0b110:
remaining = 1
elif byte >> 4 == 0b1110:
remaining = 2
elif byte >> 3 == 0b11110:
remaining = 3
else:
return False
else:
if byte >> 6 != 0b10:
return False
remaining -= 1
return remaining == 0Code Explanation
We start with no pending continuation bytes:
remaining = 0Each integer is reduced to one byte:
byte = num & 0xFFWhen no continuation bytes are pending, the current byte must begin a new character:
if remaining == 0:A byte beginning with 0 is a complete 1-byte character:
if byte >> 7 == 0b0:
remaining = 0A byte beginning with 110 starts a 2-byte character, so one continuation byte is needed:
elif byte >> 5 == 0b110:
remaining = 1Similarly:
elif byte >> 4 == 0b1110:
remaining = 2
elif byte >> 3 == 0b11110:
remaining = 3Any other starting pattern is invalid:
else:
return FalseWhen continuation bytes are pending, the byte must begin with 10:
if byte >> 6 != 0b10:
return FalseThen we consume one required continuation byte:
remaining -= 1At the end, all characters must be complete:
return remaining == 0Testing
def test_solution():
s = Solution()
assert s.validUtf8([197, 130, 1]) is True
assert s.validUtf8([235, 140, 4]) is False
assert s.validUtf8([0]) is True
assert s.validUtf8([127]) is True
assert s.validUtf8([128]) is False
assert s.validUtf8([194, 128]) is True
assert s.validUtf8([194]) is False
assert s.validUtf8([224, 160, 128]) is True
assert s.validUtf8([240, 144, 128, 128]) is True
assert s.validUtf8([255]) is False
assert s.validUtf8([248, 130, 130, 130]) is False
assert s.validUtf8([197, 130, 1, 235, 140, 4]) is False
print("all tests passed")
test_solution()Test meaning:
| Test | Why |
|---|---|
[197, 130, 1] | Valid 2-byte character followed by 1-byte character |
[235, 140, 4] | Invalid continuation byte |
[0], [127] | Valid 1-byte boundaries |
[128] | Continuation byte cannot start a character |
[194, 128] | Valid 2-byte character |
[194] | Missing continuation byte |
[224, 160, 128] | Valid 3-byte character |
[240, 144, 128, 128] | Valid 4-byte character |
[255] | Invalid leading byte |
[248, 130, 130, 130] | UTF-8 allows at most 4-byte characters |