SQL Escapades: OVER and PARTITION BY
On a handful of occasions I’ve needed to write a query that returns only one record per group of records. Like, needing to retrieve one customer record per address so we can send some mail.
Typically, you can accomplish these types of queries using GROUP BY
statements. Maybe something like this:
SELECT street_address, city, state
FROM customers
GROUP BY street_address, city, state
But what if we need to include the customers’ first and last names in the results? We can’t add the name field to our SELECT
because it’s not in our GROUP BY
clause. And we don’t want to run an aggregate function on the names because we can’t guarantee the functions will return a first and last name that match. This problem is that a grouping query isn’t want we want to do. We don’t want to get back aggregate data. We want to get back data specific to one customer – we just want one per address.
So what do we do?
OVER
The OVER
clause in SQL is a feature that allows you to retrieve aggregate data together with data from individual rows. For example, if you want to return all sales records and (for some reason) the total revenue of all sales with each record, you could do this:
SELECT *, sum(price) OVER() AS total_revenue
FROM sales
While this might be useful occasionally, it’s not helpful to our goal of retrieving one customer per address. We need…
PARTITION BY
The OVER
function doesn’t have to be empty, as above. It can take arguments! PARTITION BY
is one such argument that will run the aggregate function against the table grouped by the the given partitioning column.
So say we want to get revenue subtotals by year instead of a grand total for our sales records. In Postgres we could to this:
SELECT *, sum(price) OVER(PARTITION BY date_part('year', created_at)) AS monthly_subtotal
FROM sales
Putting It Together
Let’s get back to our original question of retrieving one customer per address. Have you figured it out yet? We can partition by the customer’s address! Then it’s a simple matter of applying the row_number()
aggregate function to the partitioned results to select only one customer per address.
SELECT results.*
FROM (
SELECT c.*, row_number() OVER(PARTITION BY street_address, city, state) AS row_num
FROM customers c
) results
WHERE results.row_number = 1
Voila! We’ve got one full customer record per address!
Let’s do another example to drive it home. Say we’re running a news database with “stories” that each belong to a single “category”. Let’s write a query to retrieve the five most recent story titles from each category.
SELECT title, row_number() OVER(PARTITION BY category_id ORDER BY created_at DESC) AS row_num
FROM stories
WHERE row_num <= 5
I’d love to see other takes on how to do these queries, so please contact me if you’ve got other ideas!