Skip to content

LeetCode 393: UTF-8 Validation

A clear explanation of validating a byte sequence as UTF-8 using bit masks and a continuation-byte counter.

Problem Restatement

We are given an integer array data.

Each integer represents one byte of data. Only the least significant 8 bits are used.

We need to decide whether this byte sequence is a valid UTF-8 encoding.

A UTF-8 character can be 1 to 4 bytes long.

The valid byte patterns are:

Number of bytesPattern
1 byte0xxxxxxx
2 bytes110xxxxx 10xxxxxx
3 bytes1110xxxx 10xxxxxx 10xxxxxx
4 bytes11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

A continuation byte must start with:

10

So a byte starting with 10 cannot begin a new character.

Input and Output

ItemMeaning
InputInteger array data
OutputTrue if it is valid UTF-8, otherwise False
Byte ruleOnly the lowest 8 bits of each integer are used
Character length1 to 4 bytes

Example function shape:

def validUtf8(data: list[int]) -> bool:
    ...

Examples

Example 1:

data = [197, 130, 1]

In binary:

197 = 11000101
130 = 10000010
1   = 00000001

The first byte starts with 110, so it begins a 2-byte character.

The next byte starts with 10, so it is a valid continuation byte.

The last byte starts with 0, so it is a valid 1-byte character.

So the answer is:

True

Example 2:

data = [235, 140, 4]

In binary:

235 = 11101011
140 = 10001100
4   = 00000100

The first byte starts with 1110, so it begins a 3-byte character.

The second byte starts with 10, so it is a valid continuation byte.

The third byte starts with 00, but a continuation byte must start with 10.

So the answer is:

False

First Thought: Convert to Binary Strings

One direct solution is to convert every byte to an 8-character binary string.

Then we can check prefixes like:

"0"
"110"
"1110"
"11110"
"10"

This is easy to read, but it does unnecessary string work.

Since the problem is about byte prefixes, bit operations fit better.

Key Insight

We scan bytes from left to right and keep one state:

remaining

remaining means how many continuation bytes we still need.

If remaining == 0, the current byte must begin a new UTF-8 character.

If remaining > 0, the current byte must be a continuation byte starting with 10.

The byte prefixes can be checked with right shifts.

For one byte b:

CheckMeaning
b >> 7 == 0b00xxxxxxx, 1-byte character
b >> 5 == 0b110110xxxxx, start of 2-byte character
b >> 4 == 0b11101110xxxx, start of 3-byte character
b >> 3 == 0b1111011110xxx, start of 4-byte character
b >> 6 == 0b1010xxxxxx, continuation byte

Algorithm

Initialize:

remaining = 0

For each number in data:

  1. Keep only the lowest 8 bits:

    byte = num & 0xFF
  2. If remaining == 0, this byte starts a new character:

    • If it starts with 0, it is a 1-byte character.
    • If it starts with 110, set remaining = 1.
    • If it starts with 1110, set remaining = 2.
    • If it starts with 11110, set remaining = 3.
    • Otherwise, return False.
  3. If remaining > 0, this byte must start with 10.

    • If not, return False.
    • Otherwise, decrement remaining.

At the end, return whether:

remaining == 0

If remaining > 0, the input ended in the middle of a multi-byte character.

Correctness

The variable remaining stores exactly how many continuation bytes are required to complete the current UTF-8 character.

When remaining == 0, the next byte must be the first byte of a character. The algorithm accepts exactly the valid first-byte prefixes: 0, 110, 1110, and 11110. It rejects invalid prefixes such as 10 and prefixes beginning with five or more 1 bits.

When a multi-byte starting byte is found, the algorithm sets remaining to the exact number of continuation bytes required by that character.

When remaining > 0, the current byte must be a continuation byte. UTF-8 continuation bytes must start with 10, and the algorithm checks exactly that condition. Each valid continuation byte reduces remaining by one.

If any required continuation byte is missing or malformed, the algorithm returns False.

After all bytes are processed, the encoding is valid only if no continuation bytes are still required. Therefore returning remaining == 0 is correct.

Complexity

Let n = len(data).

MetricValueWhy
TimeO(n)Each byte is processed once
SpaceO(1)Only a counter is stored

Implementation

class Solution:
    def validUtf8(self, data: list[int]) -> bool:
        remaining = 0

        for num in data:
            byte = num & 0xFF

            if remaining == 0:
                if byte >> 7 == 0b0:
                    remaining = 0
                elif byte >> 5 == 0b110:
                    remaining = 1
                elif byte >> 4 == 0b1110:
                    remaining = 2
                elif byte >> 3 == 0b11110:
                    remaining = 3
                else:
                    return False
            else:
                if byte >> 6 != 0b10:
                    return False

                remaining -= 1

        return remaining == 0

Code Explanation

We start with no pending continuation bytes:

remaining = 0

Each integer is reduced to one byte:

byte = num & 0xFF

When no continuation bytes are pending, the current byte must begin a new character:

if remaining == 0:

A byte beginning with 0 is a complete 1-byte character:

if byte >> 7 == 0b0:
    remaining = 0

A byte beginning with 110 starts a 2-byte character, so one continuation byte is needed:

elif byte >> 5 == 0b110:
    remaining = 1

Similarly:

elif byte >> 4 == 0b1110:
    remaining = 2
elif byte >> 3 == 0b11110:
    remaining = 3

Any other starting pattern is invalid:

else:
    return False

When continuation bytes are pending, the byte must begin with 10:

if byte >> 6 != 0b10:
    return False

Then we consume one required continuation byte:

remaining -= 1

At the end, all characters must be complete:

return remaining == 0

Testing

def test_solution():
    s = Solution()

    assert s.validUtf8([197, 130, 1]) is True
    assert s.validUtf8([235, 140, 4]) is False

    assert s.validUtf8([0]) is True
    assert s.validUtf8([127]) is True
    assert s.validUtf8([128]) is False

    assert s.validUtf8([194, 128]) is True
    assert s.validUtf8([194]) is False

    assert s.validUtf8([224, 160, 128]) is True
    assert s.validUtf8([240, 144, 128, 128]) is True

    assert s.validUtf8([255]) is False
    assert s.validUtf8([248, 130, 130, 130]) is False

    assert s.validUtf8([197, 130, 1, 235, 140, 4]) is False

    print("all tests passed")

test_solution()

Test meaning:

TestWhy
[197, 130, 1]Valid 2-byte character followed by 1-byte character
[235, 140, 4]Invalid continuation byte
[0], [127]Valid 1-byte boundaries
[128]Continuation byte cannot start a character
[194, 128]Valid 2-byte character
[194]Missing continuation byte
[224, 160, 128]Valid 3-byte character
[240, 144, 128, 128]Valid 4-byte character
[255]Invalid leading byte
[248, 130, 130, 130]UTF-8 allows at most 4-byte characters