SQL Escapades: OVER and PARTITION BY
On a handful of occasions I’ve needed to write a query that returns only one record per group of records. Like, needing to retrieve one customer record per address so we can send some mail.
Typically, you can accomplish these types of queries using
GROUP BY statements. Maybe something like this:
SELECT street_address, city, state FROM customers GROUP BY street_address, city, state
But what if we need to include the customers’ first and last names in the results? We can’t add the name field to our
SELECT because it’s not in our
GROUP BY clause. And we don’t want to run an aggregate function on the names because we can’t guarantee the functions will return a first and last name that match. This problem is that a grouping query isn’t want we want to do. We don’t want to get back aggregate data. We want to get back data specific to one customer – we just want one per address.
So what do we do?
OVER clause in SQL is a feature that allows you to retrieve aggregate data together with data from individual rows. For example, if you want to return all sales records and (for some reason) the total revenue of all sales with each record, you could do this:
SELECT *, sum(price) OVER() AS total_revenue FROM sales
While this might be useful occasionally, it’s not helpful to our goal of retrieving one customer per address. We need…
OVER function doesn’t have to be empty, as above. It can take arguments!
PARTITION BY is one such argument that will run the aggregate function against the table grouped by the the given partitioning column.
So say we want to get revenue subtotals by year instead of a grand total for our sales records. In Postgres we could to this:
SELECT *, sum(price) OVER(PARTITION BY date_part('year', created_at)) AS monthly_subtotal FROM sales
Putting It Together
Let’s get back to our original question of retrieving one customer per address. Have you figured it out yet? We can partition by the customer’s address! Then it’s a simple matter of applying the
row_number() aggregate function to the partitioned results to select only one customer per address.
SELECT results.* FROM ( SELECT c.*, row_number() OVER(PARTITION BY street_address, city, state) AS row_num FROM customers c ) results WHERE results.row_number = 1
Voila! We’ve got one full customer record per address!
Let’s do another example to drive it home. Say we’re running a news database with “stories” that each belong to a single “category”. Let’s write a query to retrieve the five most recent story titles from each category.
SELECT title, row_number() OVER(PARTITION BY category_id ORDER BY created_at DESC) AS row_num FROM stories WHERE row_num <= 5
I’d love to see other takes on how to do these queries, so please contact me if you’ve got other ideas!