Groups and Subqueries
This reading covers two important topics: computing summary values on multiple rows by grouping the rows, and nesting queries inside of queries. Of course, the two can be used in combination as well.
We'll also look at a few other topics.
- The LIKE Operator
- Distinct
- Reserved Words
- GROUP BY
- Uniformity Across a Group
- Aggregate Functions
- HAVING
- WHERE versus HAVING
- Actors that have at least 2 actor credits:
- Movies directed by people older than 50:
- Subquery Use and Syntax
- Correlated Sub-queries
- Subqueries that Return Tables
- Subquery versus Join
- Negating / Complementing Queries
- Summary
The LIKE Operator¶
We can test for equality easily:
select * from person where name = 'George Clooney';
But looking for everyone named "George" might be trickier. However, MySQL
has both full-blown regular expressions (which we might discuss later
in the course) and simple wildcard characters. This is triggered by
using the LIKE
operator instead of =
:
select * from person where name LIKE 'George%';
select * from move where title LIKE '%Potter%';
The %
character is a wildcard: it matches zero or more of any
character. Thus, the first expression matches any string starting with
'George'. The second expression matches any string that contains
'Potter': it starts with any string of characters, has Potter next,
and concludes with any string of characters.
Distinct¶
Before we get into grouping, let's look at a related notion: finding
the distinct values of something. For example, there are over a
thousand movies in our database. We could ask for the release
year
of each and get over a thousand results (years). Obviously, there
would be lots of repeats. For example, as of this writing, there are 8
movies in our database released in 1983. There are even more repeats
of more recent years.
select `release` from movie;
Suppose we just want to know the different release years in the
database, without repetitions. SQL provides an easy way to do that,
using the distinct
keyword:
select distinct `release` from movie;
The first query (above) returns results with repetitions. The second only reports each different value once.
Reserved Words¶
Every programming language has reserved words: special keywords of
the language. In Python, these are words like if
, for
or def
. In
Java, they are similar (but not def
). SQL is no different.
As it turns out, they changed the language with release 4.0 such that
release
became a reserved word; it hadn't been before, and I had
already built the WMDB and I didn't want to rename the column (and fix
all my examples).
But SQL has a very cool feature where if you put a reserved word in backquotes (backticks — see the tilde key on your keyboard), you can use it as a regular word, for columns and such.
So, in order to use release
as a normal column name, we have to put
it in backquotes like this:
select `release` from movie;
Groups
So far, we've used the SELECT statement to retrieve rows. In this section, we'll discuss grouping the rows, and returning just one representative per group.
Imagine a database of sales data:
region | date | sales |
---|---|---|
south | 2019-09-09 | $11,000 |
south | 2019-10-01 | $9,000 |
east | 2019-09-09 | $7,000 |
north | 2019-09-13 | $15,000 |
... |
We can easily imagine asking questions like:
- what was the average sales for each region?
- what was the total sales for each month?
The idea is if we just have four regions (north, south, east and west) the answer to the first question would be a table with one row per region, and the average of the sales column, regardless of how many rows are averaged. Similarly, the answer to the second question might be a table with 12 rows (one for each month) and the total sales, summed over all rows falling in that month.
GROUP BY¶
The MySQL GROUP BY clause of the SELECT statement gives us this ability. It groups all the selected rows by the distinct values in the grouping condition (like the four values for 'region' or the 12 values for the month), and then aggregating the underlying rows of the group to yield an aggregate value. For example, the first question above might be answered with a query like this:
SELECT region, avg(sales)
FROM sales_data
GROUP BY region;
That says to report the region and to average the sales column for all the rows in each group, grouping by all the distinct values of the 'region' column. Even if the table has a million rows in it, there will be only four rows in the result, with the database doing all the arithmetic for us.
Here's the solution to the other question:
SELECT month(date), sum(sales)
FROM sales_data
GROUP BY month(date);
Here, we are using a date function to extract the month number from each date, and grouping by all the distinct values of that.
Note that we can combine the GROUP BY clause with other clauses. It comes after the WHERE clause. So, if we are only interested in average sales in December, by region:
SELECT region, avg(sales)
FROM sales_data
WHERE month(date) = 12
GROUP BY region
Suppose we are interested in how many sales there were. We can also count the rows in the group:
SELECT region, avg(sales), count(sales)
FROM sales_data
WHERE month(date) = 12
GROUP BY region
Uniformity Across a Group¶
Imagine the following query, which counts the number of recent movies in our database, grouped by year.
select `release`, count(*)
from movie
where `release` > 2010
group by `release`;
The output is something like this:
release count(*)
2011 40
2012 41
2013 47
2014 38
2015 45
2016 40
2017 25
2018 31
2019 26
2020 7
So, the database has 40 movies released in 2011, and 41 released in 2012 and so forth. This is a perfectly sensible and useful query.
Let's add a column to it:
select title, `release`, count(*)
from movie
where `release` > 2010
group by `release`;
This query asks, in addition to the release year and how many are in the group, for the title of that movie. Hunh? What movie?
How can the group of 40 movies from 2011 have a title?
This query makes no sense. Indeed, many versions of MySQL forbid this kind of query; you'll get a run-time error if you try it. Ours is not configured that way. Instead, it returns a value of an arbitrary member of the group (possibly the first). So we get output like this:
title release count(*)
Jane Eyre 2011 40
Lincoln 2012 41
The Palace 2013 47
Interstellar 2014 38
Jurassic World 2015 45
Dreamland 2016 40
Wonder Woman 2017 25
The fact that you are allowed to do this doesn't mean you should.
When grouping, the columns should values that are uniform (the same) for every member of the group.
Aggregate Functions¶
So far, we've computed properties of a group with functions like
count()
and avg()
. These are called aggregate functions,
since they work on a group or aggregate of rows.
The most popular are:
- count()
- sum()
- avg()
- min()
- max()
There are many others. You can learn more about aggregate functions
HAVING¶
The WHERE clause lets us select or filter rows but sometimes we want to filter groups. To do that, we can use a HAVING clause.
For example, if we are only interested in sales averages by month only if there have been at least 10 sales that month, we could do this:
SELECT month(date),sum(sales)
FROM sales_data
GROUP BY month(date)
HAVING count(sales)>=10;
In general, the syntax is:
SELECT cols
FROM tables
WHERE boolean
GROUP BY something
HAVING boolean on group
WHERE versus HAVING¶
A lot of people get confused about WHERE versus HAVING, because they are both filters and we put in boolean conditionals. However, remember that a group has properties that individual rows don't: properties like count, sum, average and so forth. A single row of our sales_data table doesn't have a count (or it's vacuously 1). The HAVING clause is intended to filter groups and so it comes after the GROUP BY clause.
Sub-Queries
The SQL standard defines a syntax for nested queries or subqueries. Indeed, that's where the term "Structured Query Language" comes from: so far, our language has been describable by a regular expression, but now we can put queries inside queries, so we'll need context-free grammars to describe the language.
Of course, we don't really care what class of formal language SQL falls into. We care about expressivity. There are queries we can't easily express without this ability. Try the following:
Actors that have at least 2 actor credits:¶
Take a minute to try to wrap your mind around this:
-- actors who have at least two credits
use wmdb;
select nm, name
from person
where nm in (
select nm
from credit
group by nm
having count(*) >= 2);
The inner query (parenthesized SELECT statement) returns the NM of people who have two or more acting credits. The outer query returns the nm and name of anyone whose NM is returned by the inner query. We report both NM and name in case there are two actors with the same name who both have more than two credits.
Movies directed by people older than 50:¶
Here's a complex nested query:
-- Movies directed by people over 50
use wmdb;
select title
from movie
where director in (
select nm
from person
where birthdate is not null
and birthdate < date_sub(current_date(),
INTERVAL 50 YEAR));
This is scary looking, but let's take it one part at a time. The inner
query finds the NM of all people over 50. (The date_sub
function
takes today's date and subtracts 50 years. People whose birthdate is
before that date are older than 50.) Think of it as a list of NM
values. George Clooney is older than 50 and if he directed a movie,
his NM of 123 would be in that list.
The outer query just finds movies whose director ID is in the list of NM values returned by the inner query.
Note the use of the date functions here, so this query continue to work over time. You can look them up at MySQL Reference Manual, 12.5 Date and Time Functions.
Note also the removal of the people with a NULL birthdate.
Subquery Use and Syntax¶
The basic rules for subqueries are these. We'll see examples afterward.
- If the subquery returns a scalar (a single value), you can use it wherever the scalar can be.
- If the subquery returns a column of data, you can look through that
column using
IN
orNOT IN
- If the subquery returns a table of data, you must give an alias (temporary name) for that data and then use it like any table.
Returning a Scalar (single) Value
The following uses a subquery to look up Peter Jackson's NM (a scalar value) and then uses that NM in the main query.
use wmdb;
select *
from movie
where director = (select nm from person
where name='Peter Jackson');
However, if the subquery returns more than one value, you can't successfully compare a single value with multiple values, so you get no results. As an example, we look up the NM value of people with names starting with "John" and, since that returns multiple values, we get no results.
use wmdb;
select *
from movie
where director = (select nm from person
where name like 'John%');
(Indeed, the first query wouldn't work if there were more than one person named Peter Jackson in our database.)
Returning a Single Column
Here we return the TT values of every row in the credit
table. Since
all of those entries are acting credits, those TT values are movies
with at least one actor (that the database knows about). Therefore,
the outer query gives all data about movies that have at least one
known actor.
use wmdb;
select *
from movie
where tt in (select tt from credit);
What's cool about that is that we can easily invert the logic, in order to list movies without actors. (Maybe the movie is early in production and hasn't hired anyone yet, or maybe the database is incomplete.)
use wmdb;
select *
from movie
where tt not in (select tt from credit);
EXISTS and NOT EXISTS
Subqueries allow you to use a new operator in the WHERE clause: EXISTS and NOT EXISTS. For example, here's how to find out people who don't have movie credits. The inner subquery finds acting credits for the person in the outer query. The condition asks that no such acting credits exist. Thus, the person doesn't have acting credits. (Maybe they are a director.)
use wmdb;
-- people who are not actors (have no acting credits)
select name
from person
where not exists
(select *
from credit
where credit.nm = person.nm);
Remember the EXISTS and NOT EXISTS operators don't care what is returned by the query; they just care about whether zero rows are returned.
Correlated Sub-queries¶
The inner query in the last example is an important variation on our examples. Most of our examples of subqueries have stood alone and can be tested by themselves. Like this:
SELECT tt FROM credit;
But the subquery in the last section uses a value from the outer query:
SELECT tt
FROM credit
WHERE credit.nm = person.nm
That subquery can't be tested by itself, since it refers to a value from the outer query. This is called a correlated subquery. They are perfectly kosher, but just a little harder to test.
Subqueries that Return Tables¶
If the subquery returns more than one column, you can use it in the
FROM clause as if it were a table. In MySQL, they must have a
temporary name or alias, which we do with the AS
keyword. Here,
we've used T1
as the temporary name of the result of the subquery.
-- titles of movies added by scott (id of 1)
-- along with the name of the director
use wmdb;
select name,title
from person, (select title,director
from movie where addedby=1) as t1
where nm=t1.director;
Subquery versus Join¶
Subqueries can do many things that Joins can do, and vice versa. In
fact, MySQL didn't add subqueries to its capabilities until version 4
(as I recall). I started using it back in version 3, which is how I
ended up with the problem with release
becoming a reserved word.
Consider trying to find:
Actors who were directed by Clint Eastwood
Here's a solution using JOIN:
-- actors who have been directed by Clint Eastwood,
-- implemented as a join
use wmdb;
select distinct p1.name
from person as p1,credit,movie,person as p2
where p1.nm=credit.nm
and credit.tt=movie.tt
and movie.director=p2.nm
and p2.name = 'Clint Eastwood';
We used DISTINCT here, because if someone was directed by Clint more than once, we only want to return them once. (For example, if we changed the query to talk about Peter Jackson movies, Orlando Bloom was in all three of the Lord of the Rings movies, but we only want to list him once.)
Here's a solution using subqueries:
-- actors who have been directed by Clint Eastwood,
-- implemented using subqueries
use wmdb;
select name
from person
where nm in (
select nm
from credit
where tt in (
select tt
from movie
where director = (
select nm
from person
where name = 'Clint Eastwood')));
Here, we don't need the DISTINCT keyword, because the outer query is just referring to the Person table, where a person only occurs once. If they were directed by Clint more than once, their NM would appear multiple times in the results of the subquery, but that wouldn't affect the number of times they were printed by the outer query. (Again, think of Orlando Bloom in Peter Jackson movies.)
Which kind of query should you choose? You should choose the query that is clearest and easiest to understand and get right. Let the database worry about which is more efficient.
However, there can be other considerations. For example, if we want to know the titles of the Clint Eastwood movies the actors were in, the version with JOIN can easily do that, since the Movie table is part of the outer query. The version with subqueries can be adapted to do that as well (changing some of the subqueries to return more information), but the adaptation process is a little harder.
Negating / Complementing Queries¶
One advantage of subqueries is that it's usually easier to find the complement of a set using subqueries. So, to find actors who were not directed by Clint Eastwood (that is, no movie they every acted in was directed by Clint), you can negate the subquery that returns the list of Clint Eastwood movies. It's harder to negate the query using join.
Summary¶
- rows can be grouped by column values using a
GROUP BY
clause - all values must be uniform over the group
- groups can be filtered using a
HAVING
clause - subqueries allow SQL queries to be nested
- The inner query can return
- a single value (a scalar)
- a column, searched by
IN
orNOT IN
- several columns, used as a temporary table, with an alias