A clear explanation of the Valid Phone Numbers shell problem using grep and regular expressions.
Problem Restatement
We are given a text file named file.txt.
Each line contains one phone number candidate.
We need to print only the valid phone numbers.
A valid phone number must match one of these two formats:
xxx-xxx-xxxxor:
(xxx) xxx-xxxxHere, each x is a digit.
For example:
987-123-4567
123 456 7890
(123) 456-7890The valid lines are:
987-123-4567
(123) 456-7890Input and Output
| Item | Meaning |
|---|---|
| Input | A file named file.txt |
| Output | All valid phone number lines |
| Format 1 | xxx-xxx-xxxx |
| Format 2 | (xxx) xxx-xxxx |
| Language | Bash shell script |
Examples
Input file:
987-123-4567
123 456 7890
(123) 456-7890Output:
987-123-4567
(123) 456-7890The first line is valid because it has three digits, a dash, three digits, a dash, then four digits.
The second line is invalid because it uses spaces instead of the required dash format.
The third line is valid because it has the area code inside parentheses, followed by a space, then three digits, a dash, and four digits.
First Thought: Match the Valid Pattern
This is a pattern matching problem.
We do not need to transform the file. We only need to print lines that match one of two exact regular expressions.
The two valid formats are:
[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]and:
([0-9][0-9][0-9]) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]In regular expression form, we can write repeated digits more compactly as:
[0-9]{3}-[0-9]{3}-[0-9]{4}and:
\([0-9]{3}\) [0-9]{3}-[0-9]{4}The parentheses must be escaped because parentheses have special meaning in extended regular expressions.
Key Insight
We must match the whole line, not just part of the line.
That is why we use:
^for the beginning of the line, and:
$for the end of the line.
Without these anchors, a line like this could be incorrectly accepted:
abc 987-123-4567 xyzIt contains a valid-looking phone number inside the line, but the whole line is not a valid phone number.
Algorithm
Use grep with extended regular expressions.
The full pattern is:
^(\([0-9]{3}\) |[0-9]{3}-)[0-9]{3}-[0-9]{4}$This means:
| Part | Meaning |
|---|---|
^ | Start of the line |
( | Start of a regex group |
$[0-9]{3}$ | Match (xxx) |
| ` | ` |
[0-9]{3}- | Match xxx- |
) | End of the regex group |
[0-9]{3} | Match the next three digits |
- | Match a dash |
[0-9]{4} | Match the final four digits |
$ | End of the line |
So the first group chooses between the two allowed prefixes:
(xxx) or:
xxx-Then the rest is shared:
xxx-xxxxCorrectness
The regular expression starts with ^, so matching must begin at the first character of the line.
The first group matches exactly one of the two valid prefixes: either an area code in parentheses followed by one space, or three digits followed by a dash.
After that, the expression requires three digits, one dash, and four digits.
The expression ends with $, so no extra characters may appear after the final digit.
Therefore, a line is printed if and only if the whole line has one of the two valid phone number formats.
Complexity
Let n be the number of lines in the file, and let m be the maximum line length.
| Metric | Value | Why |
|---|---|---|
| Time | O(nm) | grep checks each line against the pattern |
| Space | O(1) | The command streams the file line by line |
Implementation
# Read from the file file.txt and output all valid phone numbers to stdout.
grep -E '^(\([0-9]{3}\) |[0-9]{3}-)[0-9]{3}-[0-9]{4}$' file.txtCode Explanation
The command uses grep -E:
grep -EThe -E option enables extended regular expressions. This lets us use {3}, {4}, grouping, and | more cleanly.
The first valid prefix is:
\([0-9]{3}\) This matches an opening parenthesis, three digits, a closing parenthesis, and one space.
The second valid prefix is:
[0-9]{3}-This matches three digits followed by a dash.
The shared suffix is:
[0-9]{3}-[0-9]{4}This matches three digits, a dash, and four digits.
The anchors:
^and:
$make sure the whole line matches the phone number format.
Testing
Create a sample file:
cat > file.txt << 'EOF'
987-123-4567
123 456 7890
(123) 456-7890
EOFRun the solution:
grep -E '^(\([0-9]{3}\) |[0-9]{3}-)[0-9]{3}-[0-9]{4}$' file.txtExpected output:
987-123-4567
(123) 456-7890Test invalid extra characters:
cat > file.txt << 'EOF'
abc 987-123-4567
987-123-4567 xyz
987-123-4567
EOFExpected output:
987-123-4567Test both valid formats:
cat > file.txt << 'EOF'
111-222-3333
(111) 222-3333
111 222 3333
(111)-222-3333
EOFExpected output:
111-222-3333
(111) 222-3333