Does GROUP BY only provide distinct rows? Can there be rows with same information?

One row per group. If we group the students in this class by class year, we get 2 groups: juniors and seniors.

Each row might have, say, average GPA or average height.

So the HAVING clause can only be used if the GROUP BY clause is being used?

Correct

Do we only use aggregate functions in queries that use GROUP BY?

We can also compute an aggregate function the base case group of everyone, so we could compute the average GPA or height of people in this class by just not grouping.

Can you explain again what "all values must be uniform over the group" means?

Great question, because it's something people are often confused by. Suppose we group the students in this class by class year, so two groups: juniors and seniors. What information could we report? We could report the size of the group (the count), the average/min/max of various quantities.

How about things like name and birthdate? What is the birthdate of the juniors? What would that mean?

It's not meaningful because "birthdate" is not uniform over the group.

We can ask questions like class color, because that is uniform over the group.

In MySQL, sometimes that gives an error and sometimes that gives the birthdate of an arbitrary member of the group. Our version of MySQL does the latter, which is not ideal.

Can you use GROUP BY with multiple columns?

Yes. So if we group by class year and then by birth month, we could get 2*12 or 24 groups.

Could we maybe see an example of WHERE versus HAVING? / I'm still confused on the difference between HAVING and WHERE... in the sales example I could see myself using WHERE in place of HAVING.

Good question. Let's look back at the section on having

WHERE clauses look at a single row at a time. HAVING clauses look at a single GROUP at a time.

If there were a column for "COUNT" in the table, then, sure, we can use it in the WHERE clause. Maybe the SOUTH region had 17 sales that totaled $11,000 on 2019-09-09. We can use that.

But otherwise, there isn't a COUNT property of a single row of the sales table. (Or it's vacuously equal to 1.)

But a GROUP BY clause can group all the sales, say by region, and then report the COUNT or number of sales in each region.

If we are only interested in regions that have more than 10 sales, we can use a HAVING clause to do that.

The HAVING clause is mostly used with Aggregate functions, since those are properties of the GROUP. But let's pick a property like HEIGHT. Can a person have an average height? Sorta, but not really.

But groups can have an average height. And we could use the HAVING clause to choose groups whose average height has some property.

select name
from person
where not exists
    (select *
     from credit
     where credit.nm = person.nm);

For the query above, why don't we have to give the subquery a temporary name or alias with AS, since it returns more than one column? Is this only required with FROM?

Right. In this case, we don't need to give it a name, because it's used only by the NOT EXISTS.

Could you review the code from the 'subqueries that return tables' section? I think I'm confused on this because I'm not sure why we need aliases for tables / where to use them. / I would appreciate the opportunity to go over subqueries that return tables as well as joins vs. subqueries with you during class.

Sure. Here's the subquery with table example.


select name,title
from person, (select title,director
              from movie where addedby=1) as t1
where nm=t1.director;

Here, we need an alias (T1) because we're going to refer to it in the WHERE clause.

Let's think about T1: those are titles and director ID of movies that Scott added. It's a subset of the whole Movie table.

The outer query then joins that with the Person table to provide the name of the director.

You could also do that without a subquery:


    SELECT name, title
    FROM person INNER JOIN movie ON (person.nm=movie.director)
    WHERE movie.addedby = 1;

Note (and this is important), that we can't just say addedby=1 in the WHERE clause, because there are two addedby columns: one in person and one in movie, and the query is about movies added by Scott, whether or not the director was added by Scott.

We could, however, have abbreviated the person.nm=movie.director to nm=director, since both of those columns are unique in the join.

Can you give an example of when to use or avoid joins compared to subqueries? / Can you please explain more about when it is best to use a join and subqueries? Could you expand upon how to choose between subqueries and join with some examples? Is there any real advantage of using a subquery over join or vice-versa? How does one decide where to use what in cases where both can be applicable?

Use them when that's how you think about the query. The importance is about clarity, not efficiency.

SQL is a fourth generation programming language, where you say what you want rather than how to compute it. (Though definitions vary.)

Could you give another example for what negating/complementing a query means? How is it harder to negate using join over subqueries? / Can we go over an example of how to find the complement of a data set using the idea of subqueries?

Usually, this arises where the subquery finds a set of entities fitting some condition and the outer query uses (as part of its criteria) whether something is IN or NOT IN that set.

Let's use the queries above, modifying them a little. So, now the query is "names of people who directed movies added by Scott." Using subqueries:


select name
from person
where nm IN (select director from movie where addedby=1);

The complement (names of people who did NOT direct a movie added by Scott):


select name
from person
where nm NOT IN (select director from movie where addedby=1);

With Joins:


    SELECT name
    FROM person INNER JOIN movie ON (person.nm=movie.director)
    WHERE movie.addedby <> 1;

What's wrong with that? What if "Alice" directed two movies: one added by Scott and the other not added by Scott.

Complementing using joins is left as an exercise....

Can you "stack" aggregate functions, i.e. find the minimum of a set of averages that were calculated from a table.

Yes!

Does using subqueries work like nested loops and cause a longer run time?

Sometimes.

Why do we let the database worry about efficiency? Is it not big of a difference or do we care about that later?

We are going to trust the query optimizer.