Tuesday, April 12, 2016

Calculating the Median in SQL Using Window Functions

Numerous approaches to calculating the median in SQL have been presented in books and online. Here's another one that is short and simple, and makes use of window functions. This example is specific to PostgreSQL, but could be adapted to any other DBMS that supports window functions. To illustrate this method, I'll use a set of test data in a table named median_test:

create table median_test (id text, value double precision);

insert into median_test 
    (id, value)
select
    id,
    value
from
    (select 
        generate_series(1,100) as num, 
        chr(cast(random()*10 as integer)+65) as id, 
        random() as value
    ) as dd;

To support the median calculation, two columns are added using window functions, one of which is the row number of the ordered values (but doubled), and the other of which is the total number of rows. The window functions allow these to be calculated for each data frame, which is determined here by values of the id column. In this example the additional columns are added in a temporary view, but this could instead be a subquery, a common table expression, or even an actual table.

create or replace temporary view d as
select
    id,
    value,
    row_number() over(partition by id order by value) * 2 as rownum2,
    count(*) over (partition by id) as rows
from 
    median_test;

The median calculation is then carried out by averaging either the two central values when there is an even number of values, or the one central value when there is an odd number of values.

select 
    id,
    avg(value) as median
from
    d
where 
    rownum2 in (rows, rows+1, rows+2)
group by
    id;

When there is an odd number of rows, the single median value will have a value of rownum2 equal to rows+1. When there is an even number of rows, the two central rows will have values of rownum2 equal to either rows or rows+2.

No comments:

Post a Comment