Groups and Subqueries

This reading covers two important topics: computing summary values on multiple rows by grouping the rows, and nesting queries inside of queries. Of course, the two can be used in combination as well.

We'll also look at a few other topics.

The LIKE Operator

We can test for equality easily:

select * from person where name = 'George Clooney';

But looking for everyone named "George" might be trickier. However, MySQL has both full-blown regular expressions (which we might discuss later in the course) and simple wildcard characters. This is triggered by using the LIKE operator instead of =:

select * from person where name LIKE 'George%';
select * from move where title LIKE '%Potter%';

The % character is a wildcard: it matches zero or more of any character. Thus, the first expression matches any string starting with 'George'. The second expression matches any string that contains 'Potter': it starts with any string of characters, has Potter next, and concludes with any string of characters.

Distinct

Before we get into grouping, let's look at a related notion: finding the distinct values of something. For example, there are over a thousand movies in our database. We could ask for the release year of each and get over a thousand results (years). Obviously, there would be lots of repeats. For example, as of this writing, there are 8 movies in our database released in 1983. There are even more repeats of more recent years.

select `release` from movie;

Suppose we just want to know the different release years in the database, without repetitions. SQL provides an easy way to do that, using the distinct keyword:

select distinct `release` from movie;

The first query (above) returns results with repetitions. The second only reports each different value once.

Reserved Words

Every programming language has reserved words: special keywords of the language. In Python, these are words like if, for or def. In Java, they are similar (but not def). SQL is no different.

As it turns out, they changed the language with release 4.0 such that release became a reserved word; it hadn't been before, and I had already built the WMDB and I didn't want to rename the column (and fix all my examples).

But SQL has a very cool feature where if you put a reserved word in backquotes (backticks — see the tilde key on your keyboard), you can use it as a regular word, for columns and such.

So, in order to use release as a normal column name, we have to put it in backquotes like this:

select `release` from movie;

Groups

So far, we've used the SELECT statement to retrieve rows. In this section, we'll discuss grouping the rows, and returning just one representative per group.

Imagine a database of sales data:

regiondatesales
south2019-09-09$11,000
south2019-10-01$9,000
east2019-09-09$7,000
north2019-09-13$15,000
...

We can easily imagine asking questions like:

  • what was the average sales for each region?
  • what was the total sales for each month?

The idea is if we just have four regions (north, south, east and west) the answer to the first question would be a table with one row per region, and the average of the sales column, regardless of how many rows are averaged. Similarly, the answer to the second question might be a table with 12 rows (one for each month) and the total sales, summed over all rows falling in that month.

GROUP BY

The MySQL GROUP BY clause of the SELECT statement gives us this ability. It groups all the selected rows by the distinct values in the grouping condition (like the four values for 'region' or the 12 values for the month), and then aggregating the underlying rows of the group to yield an aggregate value. For example, the first question above might be answered with a query like this:

SELECT region, avg(sales)
FROM sales_data
GROUP BY region;

That says to report the region and to average the sales column for all the rows in each group, grouping by all the distinct values of the 'region' column. Even if the table has a million rows in it, there will be only four rows in the result, with the database doing all the arithmetic for us.

Here's the solution to the other question:

SELECT month(date), sum(sales)
FROM sales_data
GROUP BY month(date);

Here, we are using a date function to extract the month number from each date, and grouping by all the distinct values of that.

Note that we can combine the GROUP BY clause with other clauses. It comes after the WHERE clause. So, if we are only interested in average sales in December, by region:

SELECT region, avg(sales)
FROM sales_data
WHERE month(date) = 12
GROUP BY region

Suppose we are interested in how many sales there were. We can also count the rows in the group:

SELECT region, avg(sales), count(sales)
FROM sales_data
WHERE month(date) = 12
GROUP BY region

Uniformity Across a Group

Imagine the following query, which counts the number of recent movies in our database, grouped by year.

select `release`, count(*) 
from movie
where `release` > 2010
group by `release`;

The output is something like this:

release count(*)
2011    40
2012    41
2013    47
2014    38
2015    45
2016    40
2017    25
2018    31
2019    26
2020    7

So, the database has 40 movies released in 2011, and 41 released in 2012 and so forth. This is a perfectly sensible and useful query.

Let's add a column to it:

select title, `release`, count(*) 
from movie
where `release` > 2010
group by `release`;

This query asks, in addition to the release year and how many are in the group, for the title of that movie. Hunh? What movie?

How can the group of 40 movies from 2011 have a title?

This query makes no sense. Indeed, many versions of MySQL forbid this kind of query; you'll get a run-time error if you try it. Ours is not configured that way. Instead, it returns a value of an arbitrary member of the group (possibly the first). So we get output like this:

title        release count(*)
Jane Eyre       2011    40
Lincoln         2012    41
The Palace      2013    47
Interstellar    2014    38
Jurassic World  2015    45
Dreamland       2016    40
Wonder Woman    2017    25

The fact that you are allowed to do this doesn't mean you should.

When grouping, the columns should values that are uniform (the same) for every member of the group.

Aggregate Functions

So far, we've computed properties of a group with functions like count() and avg(). These are called aggregate functions, since they work on a group or aggregate of rows.

The most popular are:

  • count()
  • sum()
  • avg()
  • min()
  • max()

There are many others. You can learn more about aggregate functions

HAVING

The WHERE clause lets us select or filter rows but sometimes we want to filter groups. To do that, we can use a HAVING clause.

For example, if we are only interested in sales averages by month only if there have been at least 10 sales that month, we could do this:

SELECT month(date),sum(sales)
FROM sales_data
GROUP BY month(date)
HAVING count(sales)>=10;

In general, the syntax is:

SELECT cols
FROM tables
WHERE boolean
GROUP BY something
HAVING boolean on group

WHERE versus HAVING

A lot of people get confused about WHERE versus HAVING, because they are both filters and we put in boolean conditionals. However, remember that a group has properties that individual rows don't: properties like count, sum, average and so forth. A single row of our sales_data table doesn't have a count (or it's vacuously 1). The HAVING clause is intended to filter groups and so it comes after the GROUP BY clause.

Sub-Queries

The SQL standard defines a syntax for nested queries or subqueries. Indeed, that's where the term "Structured Query Language" comes from: so far, our language has been describable by a regular expression, but now we can put queries inside queries, so we'll need context-free grammars to describe the language.

Of course, we don't really care what class of formal language SQL falls into. We care about expressivity. There are queries we can't easily express without this ability. Try the following:

Actors that have at least 2 actor credits:

Take a minute to try to wrap your mind around this:

-- actors who have at least two credits

use wmdb;

select nm, name
from person
where nm in (
    select nm
    from credit
    group by nm
        having count(*) >= 2);

The inner query (parenthesized SELECT statement) returns the NM of people who have two or more acting credits. The outer query returns the nm and name of anyone whose NM is returned by the inner query. We report both NM and name in case there are two actors with the same name who both have more than two credits.

Movies directed by people older than 50:

Here's a complex nested query:

-- Movies directed by people over 50

use wmdb;

select title
from movie
where director in (
    select nm
    from person
    where birthdate is not null
          and birthdate < date_sub(current_date(),
                                   INTERVAL 50 YEAR));

This is scary looking, but let's take it one part at a time. The inner query finds the NM of all people over 50. (The date_sub function takes today's date and subtracts 50 years. People whose birthdate is before that date are older than 50.) Think of it as a list of NM values. George Clooney is older than 50 and if he directed a movie, his NM of 123 would be in that list.

The outer query just finds movies whose director ID is in the list of NM values returned by the inner query.

Note the use of the date functions here, so this query continue to work over time. You can look them up at MySQL Reference Manual, 12.5 Date and Time Functions.

Note also the removal of the people with a NULL birthdate.

Subquery Use and Syntax

The basic rules for subqueries are these. We'll see examples afterward.

  • If the subquery returns a scalar (a single value), you can use it wherever the scalar can be.
  • If the subquery returns a column of data, you can look through that column using IN or NOT IN
  • If the subquery returns a table of data, you must give an alias (temporary name) for that data and then use it like any table.

Returning a Scalar (single) Value

The following uses a subquery to look up Peter Jackson's NM (a scalar value) and then uses that NM in the main query.

use wmdb;

select *
from movie
where director = (select nm from person
                  where name='Peter Jackson');

However, if the subquery returns more than one value, you can't successfully compare a single value with multiple values, so you get no results. As an example, we look up the NM value of people with names starting with "John" and, since that returns multiple values, we get no results.

use wmdb;

select *
from movie
where director = (select nm from person
                  where name like 'John%');

(Indeed, the first query wouldn't work if there were more than one person named Peter Jackson in our database.)

Returning a Single Column

Here we return the TT values of every row in the credit table. Since all of those entries are acting credits, those TT values are movies with at least one actor (that the database knows about). Therefore, the outer query gives all data about movies that have at least one known actor.

use wmdb;

select *
from movie
where tt in (select tt from credit);

What's cool about that is that we can easily invert the logic, in order to list movies without actors. (Maybe the movie is early in production and hasn't hired anyone yet, or maybe the database is incomplete.)

use wmdb;

select *
from movie
where tt not in (select tt from credit);

EXISTS and NOT EXISTS

Subqueries allow you to use a new operator in the WHERE clause: EXISTS and NOT EXISTS. For example, here's how to find out people who don't have movie credits. The inner subquery finds acting credits for the person in the outer query. The condition asks that no such acting credits exist. Thus, the person doesn't have acting credits. (Maybe they are a director.)

use wmdb;

-- people who are not actors (have no acting credits)

select name
from person
where not exists
    (select *
     from credit
     where credit.nm = person.nm);

Remember the EXISTS and NOT EXISTS operators don't care what is returned by the query; they just care about whether zero rows are returned.

Correlated Sub-queries

The inner query in the last example is an important variation on our examples. Most of our examples of subqueries have stood alone and can be tested by themselves. Like this:

SELECT tt FROM credit;

But the subquery in the last section uses a value from the outer query:

SELECT tt
FROM credit
WHERE credit.nm = person.nm

That subquery can't be tested by itself, since it refers to a value from the outer query. This is called a correlated subquery. They are perfectly kosher, but just a little harder to test.

Subqueries that Return Tables

If the subquery returns more than one column, you can use it in the FROM clause as if it were a table. In MySQL, they must have a temporary name or alias, which we do with the AS keyword. Here, we've used T1 as the temporary name of the result of the subquery.

-- titles of movies added by scott (id of 1)
-- along with the name of the director

use wmdb;

select name,title
from person, (select title,director
              from movie where addedby=1) as t1
where nm=t1.director;

Subquery versus Join

Subqueries can do many things that Joins can do, and vice versa. In fact, MySQL didn't add subqueries to its capabilities until version 4 (as I recall). I started using it back in version 3, which is how I ended up with the problem with release becoming a reserved word.

Consider trying to find:

Actors who were directed by Clint Eastwood

Here's a solution using JOIN:

-- actors who have been directed by Clint Eastwood,
-- implemented as a join

use wmdb;

select distinct p1.name
from person as p1,credit,movie,person as p2
where p1.nm=credit.nm
    and credit.tt=movie.tt
    and movie.director=p2.nm
    and p2.name = 'Clint Eastwood';

We used DISTINCT here, because if someone was directed by Clint more than once, we only want to return them once. (For example, if we changed the query to talk about Peter Jackson movies, Orlando Bloom was in all three of the Lord of the Rings movies, but we only want to list him once.)

Here's a solution using subqueries:

-- actors who have been directed by Clint Eastwood,
-- implemented using subqueries

use wmdb;

select name
from person
where nm in (
    select nm
    from credit
    where tt in (
        select tt
        from movie
        where director = (
            select nm
            from person
            where name = 'Clint Eastwood')));

Here, we don't need the DISTINCT keyword, because the outer query is just referring to the Person table, where a person only occurs once. If they were directed by Clint more than once, their NM would appear multiple times in the results of the subquery, but that wouldn't affect the number of times they were printed by the outer query. (Again, think of Orlando Bloom in Peter Jackson movies.)

Which kind of query should you choose? You should choose the query that is clearest and easiest to understand and get right. Let the database worry about which is more efficient.

However, there can be other considerations. For example, if we want to know the titles of the Clint Eastwood movies the actors were in, the version with JOIN can easily do that, since the Movie table is part of the outer query. The version with subqueries can be adapted to do that as well (changing some of the subqueries to return more information), but the adaptation process is a little harder.

Negating / Complementing Queries

One advantage of subqueries is that it's usually easier to find the complement of a set using subqueries. So, to find actors who were not directed by Clint Eastwood (that is, no movie they every acted in was directed by Clint), you can negate the subquery that returns the list of Clint Eastwood movies. It's harder to negate the query using join.

Summary

  • rows can be grouped by column values using a GROUP BY clause
  • all values must be uniform over the group
  • groups can be filtered using a HAVING clause
  • subqueries allow SQL queries to be nested
  • The inner query can return
    • a single value (a scalar)
    • a column, searched by IN or NOT IN
    • several columns, used as a temporary table, with an alias