A clear explanation of Find Median Given Frequency of Numbers using cumulative frequency and SQL window functions.
Problem Restatement
We are given a Numbers table.
| Column | Type | Meaning |
|---|---|---|
num | int | The number value |
frequency | int | How many times this number appears |
Each row stores a number and its frequency.
Conceptually, the table represents a decompressed list of numbers. For example, if num = 0 and frequency = 7, then 0 appears seven times in the full list.
We need to report the median of all numbers after decompression.
The result column must be named:
medianThe LeetCode statement asks to round the median to one decimal point.
Input and Output
| Item | Meaning |
|---|---|
| Input | Numbers table |
| Output | Median of the decompressed numbers |
| Output column | median |
| Avoid | Expanding frequency rows into many rows |
Expected output:
medianExample
Input:
| num | frequency |
|---|---|
| 0 | 7 |
| 1 | 1 |
| 2 | 3 |
| 3 | 1 |
The decompressed list is:
0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 3There are 12 values.
The middle positions are 6 and 7 if we use 1-based indexing.
Both values are 0.
So the median is:
0.0Output:
| median |
|---|
| 0.0 |
First Thought: Decompress the Table
A direct idea is to expand every row into frequency rows.
For example:
| num | frequency |
|---|---|
| 2 | 3 |
would become:
| value |
|---|
| 2 |
| 2 |
| 2 |
Then we could sort the expanded rows and find the middle value.
This is easy to understand, but it is not a good database solution. Frequencies can be large, so the expanded table may be much larger than the original table.
We should compute the median from the frequency distribution directly.
Key Insight
After sorting by num, each row covers a range of positions in the decompressed list.
For the example:
| num | frequency | Position range |
|---|---|---|
| 0 | 7 | 1 through 7 |
| 1 | 1 | 8 through 8 |
| 2 | 3 | 9 through 11 |
| 3 | 1 | 12 through 12 |
The median position or positions must fall inside one or two of these ranges.
So the task becomes:
- Compute cumulative frequency by ascending
num. - Compute the total frequency.
- Select the row or rows whose cumulative range contains the median position.
- Average their
numvalues.
Cumulative Frequency
For each row, define:
running_frequency = SUM(frequency) OVER (ORDER BY num)This is the last decompressed position occupied by the current num.
The first position occupied by the current row is:
running_frequency - frequency + 1So the row covers:
running_frequency - frequency + 1
through
running_frequencyMedian Position Rule
Let:
total = SUM(frequency) OVER ()For an odd total, there is one median position.
For an even total, there are two middle positions.
A compact SQL trick is to select rows where:
total / 2.0 BETWEEN running_frequency - frequency AND running_frequencyThis works because the boundary expression identifies the row or rows crossing the middle of the distribution.
Another common equivalent approach computes cumulative sums from both directions and keeps rows that overlap the center. The Doocs solution uses ascending and descending cumulative frequencies for this idea.
Algorithm
- Sort rows by
num. - Compute:
- Running cumulative frequency.
- Total frequency.
- Keep rows whose frequency range contains the middle of the decompressed data.
- Return the average of those
numvalues. - Round to one decimal point.
Correctness
Sorting by num gives the same order as the decompressed sorted list.
For each row, the cumulative frequency tells us the final position of that number in the decompressed sorted list. Subtracting frequency gives the position just before that number’s block begins.
The median is determined only by the middle position or two middle positions in this sorted decompressed list.
The query keeps exactly the number row or rows whose cumulative frequency interval crosses the middle. If the total count is odd, one row is selected. If the total count is even and the two middle values are different, two rows are selected. If the two middle values are the same, one row may still represent both middle positions.
Averaging the selected num values therefore gives the median. Rounding formats it as required.
Complexity
Let n be the number of rows in Numbers.
| Metric | Value | Why |
|---|---|---|
| Time | O(n log n) | The rows are sorted by num for the window function |
| Space | O(n) | The database may materialize window results |
This avoids creating SUM(frequency) decompressed rows.
Implementation
WITH NumbersMetadata AS (
SELECT
num,
frequency,
SUM(frequency) OVER (
ORDER BY num
) AS running_frequency,
SUM(frequency) OVER () AS total_frequency
FROM Numbers
)
SELECT
ROUND(AVG(num), 1) AS median
FROM NumbersMetadata
WHERE total_frequency / 2.0
BETWEEN running_frequency - frequency
AND running_frequency;Code Explanation
The CTE computes metadata for every number:
WITH NumbersMetadata AS (...)The running frequency gives the right edge of the number’s interval in the decompressed sorted list:
SUM(frequency) OVER (
ORDER BY num
) AS running_frequencyThe total frequency gives the decompressed list length:
SUM(frequency) OVER () AS total_frequencyThe WHERE clause keeps the row or rows that contain the median boundary:
WHERE total_frequency / 2.0
BETWEEN running_frequency - frequency
AND running_frequencyFinally, the query averages the selected values and rounds the result:
ROUND(AVG(num), 1) AS medianAlternative Two-Direction Window Solution
This version computes cumulative frequency from both sides.
WITH t AS (
SELECT
num,
frequency,
SUM(frequency) OVER (ORDER BY num ASC) AS left_count,
SUM(frequency) OVER (ORDER BY num DESC) AS right_count,
SUM(frequency) OVER () AS total_count
FROM Numbers
)
SELECT
ROUND(AVG(num), 1) AS median
FROM t
WHERE left_count >= total_count / 2
AND right_count >= total_count / 2;The condition keeps numbers that lie across the center of the decompressed sorted distribution.
Testing
Create sample data:
CREATE TABLE Numbers (
num INT PRIMARY KEY,
frequency INT
);
INSERT INTO Numbers (num, frequency) VALUES
(0, 7),
(1, 1),
(2, 3),
(3, 1);Expected result:
| median |
|---|
| 0.0 |
Additional cases:
| Numbers table | Expected median | Why |
|---|---|---|
(1, 1), (2, 1), (3, 1) | 2.0 | Odd total count |
(1, 1), (2, 1) | 1.5 | Even count, two different middle values |
(1, 2), (2, 2) | 1.5 | Middle positions split across two values |
(5, 10) | 5.0 | Only one distinct number |
(1, 1), (10, 1), (100, 1), (1000, 1) | 55.0 | Average of 10 and 100 |