Skip to content

LeetCode 192: Word Frequency

A clear explanation of the Word Frequency shell problem using Unix text-processing tools.

Problem Restatement

We are given a text file named words.txt.

We need to write a Bash script that counts how many times each word appears in the file.

The output should show each word followed by its frequency, sorted from highest frequency to lowest frequency.

The problem assumes that:

RuleMeaning
CharactersThe file contains only lowercase letters and spaces
WordsEach word contains lowercase letters only
SeparatorsWords are separated by one or more whitespace characters
TiesFrequency counts are unique, so tie handling is not needed

Example input:

the day is sunny the the
the sunny is is

Expected output:

the 4
is 3
sunny 2
day 1

Input and Output

ItemMeaning
InputA file named words.txt
OutputEach word and its frequency
OrderDescending by frequency
LanguageBash shell script

Key Insight

The important detail is that uniq -c only counts adjacent duplicate lines.

So before counting, we must make equal words adjacent.

That gives us the pipeline:

  1. Put each word on its own line.
  2. Sort the words alphabetically.
  3. Count adjacent equal words.
  4. Sort the counts in descending order.
  5. Print the word first, then the count.

Algorithm

Start with the file:

cat words.txt

Convert spaces into newlines:

tr -s ' ' '\n'

The -s option squeezes repeated spaces, so multiple spaces behave like one separator.

Then sort the words:

sort

Now equal words are next to each other.

Count repeated adjacent words:

uniq -c

This produces output like:

1 day
3 is
2 sunny
4 the

Sort by count in descending numeric order:

sort -nr

Finally, swap the columns so the word appears before the count:

awk '{ print $2, $1 }'

Correctness

After tr -s ' ' '\n', every word from the file appears on its own line.

After sort, all equal words are grouped together. Therefore, each group contains exactly all occurrences of one word.

The command uniq -c counts the size of each group, so it computes the correct frequency for every distinct word.

Then sort -nr orders the counted rows by descending numeric frequency.

Finally, awk '{ print $2, $1 }' changes the display format from count word to word count.

Therefore, the pipeline prints every word with its correct frequency in descending frequency order.

Complexity

Let n be the number of words in the file.

MetricValueWhy
TimeO(n log n)Sorting dominates the pipeline
SpaceO(n)Sorting may need storage proportional to the input size

Implementation

# Read from the file words.txt and output the word frequency list to stdout.
cat words.txt | tr -s ' ' '\n' | sort | uniq -c | sort -nr | awk '{ print $2, $1 }'

Code Explanation

The command starts by reading the file:

cat words.txt

Then it normalizes the input so each word appears on a separate line:

tr -s ' ' '\n'

Next, it sorts the words:

sort

This step is required because uniq -c only counts duplicate lines that are next to each other.

Then it counts each group:

uniq -c

The result has the count first and the word second.

Then it sorts by frequency:

sort -nr

The -n option means numeric sort.

The -r option means reverse order, so larger counts come first.

Finally:

awk '{ print $2, $1 }'

prints the second field first, then the first field.

So this:

4 the

becomes:

the 4

Testing

Create a sample file:

cat > words.txt << 'EOF'
the day is sunny the the
the sunny is is
EOF

Run the solution:

cat words.txt | tr -s ' ' '\n' | sort | uniq -c | sort -nr | awk '{ print $2, $1 }'

Expected output:

the 4
is 3
sunny 2
day 1

Another test with repeated spaces:

cat > words.txt << 'EOF'
a   b a
c b   a
EOF

Expected output:

a 3
b 2
c 1