# LeetCode 393: UTF-8 Validation

## Problem Restatement

We are given an integer array `data`.

Each integer represents one byte of data. Only the least significant 8 bits are used.

We need to decide whether this byte sequence is a valid UTF-8 encoding.

A UTF-8 character can be 1 to 4 bytes long.

The valid byte patterns are:

| Number of bytes | Pattern |
|---|---|
| 1 byte | `0xxxxxxx` |
| 2 bytes | `110xxxxx 10xxxxxx` |
| 3 bytes | `1110xxxx 10xxxxxx 10xxxxxx` |
| 4 bytes | `11110xxx 10xxxxxx 10xxxxxx 10xxxxxx` |

A continuation byte must start with:

```text
10
```

So a byte starting with `10` cannot begin a new character.

## Input and Output

| Item | Meaning |
|---|---|
| Input | Integer array `data` |
| Output | `True` if it is valid UTF-8, otherwise `False` |
| Byte rule | Only the lowest 8 bits of each integer are used |
| Character length | 1 to 4 bytes |

Example function shape:

```python
def validUtf8(data: list[int]) -> bool:
    ...
```

## Examples

Example 1:

```python
data = [197, 130, 1]
```

In binary:

```text
197 = 11000101
130 = 10000010
1   = 00000001
```

The first byte starts with `110`, so it begins a 2-byte character.

The next byte starts with `10`, so it is a valid continuation byte.

The last byte starts with `0`, so it is a valid 1-byte character.

So the answer is:

```python
True
```

Example 2:

```python
data = [235, 140, 4]
```

In binary:

```text
235 = 11101011
140 = 10001100
4   = 00000100
```

The first byte starts with `1110`, so it begins a 3-byte character.

The second byte starts with `10`, so it is a valid continuation byte.

The third byte starts with `00`, but a continuation byte must start with `10`.

So the answer is:

```python
False
```

## First Thought: Convert to Binary Strings

One direct solution is to convert every byte to an 8-character binary string.

Then we can check prefixes like:

```python
"0"
"110"
"1110"
"11110"
"10"
```

This is easy to read, but it does unnecessary string work.

Since the problem is about byte prefixes, bit operations fit better.

## Key Insight

We scan bytes from left to right and keep one state:

```python
remaining
```

`remaining` means how many continuation bytes we still need.

If `remaining == 0`, the current byte must begin a new UTF-8 character.

If `remaining > 0`, the current byte must be a continuation byte starting with `10`.

The byte prefixes can be checked with right shifts.

For one byte `b`:

| Check | Meaning |
|---|---|
| `b >> 7 == 0b0` | `0xxxxxxx`, 1-byte character |
| `b >> 5 == 0b110` | `110xxxxx`, start of 2-byte character |
| `b >> 4 == 0b1110` | `1110xxxx`, start of 3-byte character |
| `b >> 3 == 0b11110` | `11110xxx`, start of 4-byte character |
| `b >> 6 == 0b10` | `10xxxxxx`, continuation byte |

## Algorithm

Initialize:

```python
remaining = 0
```

For each number in `data`:

1. Keep only the lowest 8 bits:
   ```python id="d1f95d"
   byte = num & 0xFF
   ```

2. If `remaining == 0`, this byte starts a new character:
   - If it starts with `0`, it is a 1-byte character.
   - If it starts with `110`, set `remaining = 1`.
   - If it starts with `1110`, set `remaining = 2`.
   - If it starts with `11110`, set `remaining = 3`.
   - Otherwise, return `False`.

3. If `remaining > 0`, this byte must start with `10`.
   - If not, return `False`.
   - Otherwise, decrement `remaining`.

At the end, return whether:

```python
remaining == 0
```

If `remaining > 0`, the input ended in the middle of a multi-byte character.

## Correctness

The variable `remaining` stores exactly how many continuation bytes are required to complete the current UTF-8 character.

When `remaining == 0`, the next byte must be the first byte of a character. The algorithm accepts exactly the valid first-byte prefixes: `0`, `110`, `1110`, and `11110`. It rejects invalid prefixes such as `10` and prefixes beginning with five or more `1` bits.

When a multi-byte starting byte is found, the algorithm sets `remaining` to the exact number of continuation bytes required by that character.

When `remaining > 0`, the current byte must be a continuation byte. UTF-8 continuation bytes must start with `10`, and the algorithm checks exactly that condition. Each valid continuation byte reduces `remaining` by one.

If any required continuation byte is missing or malformed, the algorithm returns `False`.

After all bytes are processed, the encoding is valid only if no continuation bytes are still required. Therefore returning `remaining == 0` is correct.

## Complexity

Let `n = len(data)`.

| Metric | Value | Why |
|---|---|---|
| Time | `O(n)` | Each byte is processed once |
| Space | `O(1)` | Only a counter is stored |

## Implementation

```python
class Solution:
    def validUtf8(self, data: list[int]) -> bool:
        remaining = 0

        for num in data:
            byte = num & 0xFF

            if remaining == 0:
                if byte >> 7 == 0b0:
                    remaining = 0
                elif byte >> 5 == 0b110:
                    remaining = 1
                elif byte >> 4 == 0b1110:
                    remaining = 2
                elif byte >> 3 == 0b11110:
                    remaining = 3
                else:
                    return False
            else:
                if byte >> 6 != 0b10:
                    return False

                remaining -= 1

        return remaining == 0
```

## Code Explanation

We start with no pending continuation bytes:

```python
remaining = 0
```

Each integer is reduced to one byte:

```python
byte = num & 0xFF
```

When no continuation bytes are pending, the current byte must begin a new character:

```python
if remaining == 0:
```

A byte beginning with `0` is a complete 1-byte character:

```python
if byte >> 7 == 0b0:
    remaining = 0
```

A byte beginning with `110` starts a 2-byte character, so one continuation byte is needed:

```python
elif byte >> 5 == 0b110:
    remaining = 1
```

Similarly:

```python
elif byte >> 4 == 0b1110:
    remaining = 2
elif byte >> 3 == 0b11110:
    remaining = 3
```

Any other starting pattern is invalid:

```python
else:
    return False
```

When continuation bytes are pending, the byte must begin with `10`:

```python
if byte >> 6 != 0b10:
    return False
```

Then we consume one required continuation byte:

```python
remaining -= 1
```

At the end, all characters must be complete:

```python
return remaining == 0
```

## Testing

```python
def test_solution():
    s = Solution()

    assert s.validUtf8([197, 130, 1]) is True
    assert s.validUtf8([235, 140, 4]) is False

    assert s.validUtf8([0]) is True
    assert s.validUtf8([127]) is True
    assert s.validUtf8([128]) is False

    assert s.validUtf8([194, 128]) is True
    assert s.validUtf8([194]) is False

    assert s.validUtf8([224, 160, 128]) is True
    assert s.validUtf8([240, 144, 128, 128]) is True

    assert s.validUtf8([255]) is False
    assert s.validUtf8([248, 130, 130, 130]) is False

    assert s.validUtf8([197, 130, 1, 235, 140, 4]) is False

    print("all tests passed")

test_solution()
```

Test meaning:

| Test | Why |
|---|---|
| `[197, 130, 1]` | Valid 2-byte character followed by 1-byte character |
| `[235, 140, 4]` | Invalid continuation byte |
| `[0]`, `[127]` | Valid 1-byte boundaries |
| `[128]` | Continuation byte cannot start a character |
| `[194, 128]` | Valid 2-byte character |
| `[194]` | Missing continuation byte |
| `[224, 160, 128]` | Valid 3-byte character |
| `[240, 144, 128, 128]` | Valid 4-byte character |
| `[255]` | Invalid leading byte |
| `[248, 130, 130, 130]` | UTF-8 allows at most 4-byte characters |

