A clear explanation of the Word Frequency shell problem using Unix text-processing tools.
Problem Restatement
We are given a text file named words.txt.
We need to write a Bash script that counts how many times each word appears in the file.
The output should show each word followed by its frequency, sorted from highest frequency to lowest frequency.
The problem assumes that:
| Rule | Meaning |
|---|---|
| Characters | The file contains only lowercase letters and spaces |
| Words | Each word contains lowercase letters only |
| Separators | Words are separated by one or more whitespace characters |
| Ties | Frequency counts are unique, so tie handling is not needed |
Example input:
the day is sunny the the
the sunny is isExpected output:
the 4
is 3
sunny 2
day 1Input and Output
| Item | Meaning |
|---|---|
| Input | A file named words.txt |
| Output | Each word and its frequency |
| Order | Descending by frequency |
| Language | Bash shell script |
Key Insight
The important detail is that uniq -c only counts adjacent duplicate lines.
So before counting, we must make equal words adjacent.
That gives us the pipeline:
- Put each word on its own line.
- Sort the words alphabetically.
- Count adjacent equal words.
- Sort the counts in descending order.
- Print the word first, then the count.
Algorithm
Start with the file:
cat words.txtConvert spaces into newlines:
tr -s ' ' '\n'The -s option squeezes repeated spaces, so multiple spaces behave like one separator.
Then sort the words:
sortNow equal words are next to each other.
Count repeated adjacent words:
uniq -cThis produces output like:
1 day
3 is
2 sunny
4 theSort by count in descending numeric order:
sort -nrFinally, swap the columns so the word appears before the count:
awk '{ print $2, $1 }'Correctness
After tr -s ' ' '\n', every word from the file appears on its own line.
After sort, all equal words are grouped together. Therefore, each group contains exactly all occurrences of one word.
The command uniq -c counts the size of each group, so it computes the correct frequency for every distinct word.
Then sort -nr orders the counted rows by descending numeric frequency.
Finally, awk '{ print $2, $1 }' changes the display format from count word to word count.
Therefore, the pipeline prints every word with its correct frequency in descending frequency order.
Complexity
Let n be the number of words in the file.
| Metric | Value | Why |
|---|---|---|
| Time | O(n log n) | Sorting dominates the pipeline |
| Space | O(n) | Sorting may need storage proportional to the input size |
Implementation
# Read from the file words.txt and output the word frequency list to stdout.
cat words.txt | tr -s ' ' '\n' | sort | uniq -c | sort -nr | awk '{ print $2, $1 }'Code Explanation
The command starts by reading the file:
cat words.txtThen it normalizes the input so each word appears on a separate line:
tr -s ' ' '\n'Next, it sorts the words:
sortThis step is required because uniq -c only counts duplicate lines that are next to each other.
Then it counts each group:
uniq -cThe result has the count first and the word second.
Then it sorts by frequency:
sort -nrThe -n option means numeric sort.
The -r option means reverse order, so larger counts come first.
Finally:
awk '{ print $2, $1 }'prints the second field first, then the first field.
So this:
4 thebecomes:
the 4Testing
Create a sample file:
cat > words.txt << 'EOF'
the day is sunny the the
the sunny is is
EOFRun the solution:
cat words.txt | tr -s ' ' '\n' | sort | uniq -c | sort -nr | awk '{ print $2, $1 }'Expected output:
the 4
is 3
sunny 2
day 1Another test with repeated spaces:
cat > words.txt << 'EOF'
a b a
c b a
EOFExpected output:
a 3
b 2
c 1