Yes. You can compute a value from the row and then group by the values of that. We'll see an example today, where we group by the values of a boolean expression.
You can do that with ALTER TABLE. We'll do that later in the course.
Excellent point! Let's try it in the WMDB:
use wmdb; select count(*) from person; select count(*) from person where birthdate is null; select count(birthdate) from person; select count(birthdate) from person where birthdate is not null;
We could group students in this class by year, by dorm, by age, by country, by ...
Now, imagine computing a property of the group, say average GPA or average height.
Do seniors have better grades than juniors? Are people from the US taller than people from China?
Here's a common mistake:
select name from students group by year
Let's think about what that means
SELECT region, avg(sales) FROM sales_data GROUP BY region;For this example, how would the dates be arranged/aggregated?"
Does it matter? Why would it matter? The dates are "squeezed out" when we group by region, because we end up with one aggregate value per region, averaging over all dates.
Interesting question. If I have a group of just three numbers: 1,2,3 I can define what I mean by the sum but how would you define the subtraction?
Readability is sometimes a matter of taste. Many students find subqueries more readable, but not all. Decide what works for you.
select distinct p1.name from person as p1,credit,movie,person as p2 where p1.nm=credit.nm and credit.tt=movie.tt and movie.director=p2.nm and p2.name = 'Clint Eastwood';"
This is an important example, because it really takes some getting used to.
We need to use the Person table twice because one of them will be Clint and the other will be an actor in a Clint Eastwood movie.
A generalization of this might be pairs of actors (co-stars) or pairs of pets (our breeding pairs example).
-- actors who have been directed by Clint Eastwood, -- implemented as a join use wmdb; select distinct p1.name from person as p1,credit,movie,person as p2 where p1.nm=credit.nm and credit.tt=movie.tt and movie.director=p2.nm and p2.name = 'Clint Eastwood';Why do to check the person column twice in the from clause? When would we put the same column twice in the from clause? Why do we even need to check that movie.director = p2.nm when each movie has a unique director?"
We need to use the PERSON table twice because we have two people in the result. When matching items, we need to enforce the join conditions.
In the "Subquery versus Join" section of the reading, could you go over the first solution (the one using JOIN)? I am having trouble figuring where/what is being joined.
I don't think I understand the ""-- actors who have been directed by Clint Eastwood, -- implemented as a join"" example, can we please go over this in class? Thanks.
I'll go over this as many times as people need.
Very often, the way we structure a query yields duplicates. For example, if we list actors in Peter Jackson movies, we'll get Orlando Bloom like six times. But maybe we only want to list him once.
Sure, we'll see some.
Great question! In general, databases are too large to test a query "by hand" so you have to have confidence in your logic. That's part of our goal in this course.
Nevertheless, testing your assumptions with various queries, particularly simpler ones, is helpful.