Joins
One of the most important operations on relational databases is to join two tables.
Joining is how relational databases represent things like lists.
- Vet Office
- WMDB
- Join is dynamic
- Primary Keys
- Keys and Foreign Keys
- Joins
- Mappings
- Movie Directors
- Foreign Keys
- Backwards Representation!
- Misconceptions
- Queries with 1:N Relationships
- Looking up Movie Directors
- Aside about Cross Products
- Self-Crosses
- Many-Many Relationships
- Represent Many-Many Relationships with an Intermediate Table
- Other Attributes in an Intermediate Table
- Primary Key in the Intermediate Table
- Looking up Acting Credits
- Special Join Notations
- Join ON
- Join USING a shared field
- Natural Joins
- Summary
Vet Office¶
Suppose each customer of a vet office has one or more pets. That is, we might think of them having a list of pets.
Still, each pet will have a lot of information associated with it: a very
wide database row. That row will exist in a table where each row is about
a single pet. We might call that table pet
We will have a separate table of owners, with their name, address,
phone number, how much they owe the vet office, and so forth. That
information will exist in a table where each row is about a single
owner. Each owner probably has an Owner ID number, which we will call
oid
. We might call that table owner
.
Each pet will keep track of its owner by storing the OID of its owner.
So, to have a query that reports information on both a pet and its owner, we need to join the two tables. For example, to list the name of both the pet and its owner, we might do something like this:
SELECT owner.name, pet.name
FROM owner INNER JOIN pet USING (oid);
The results might look like this:
MariaDB [cs304_db]> select owner.name, pet.name from owner inner join pet using (oid);
+------------------+-----------------------+
| name | name |
+------------------+-----------------------+
| Homer Simpson | Santa's Little Helper |
| Homer Simpson | Snowball |
| Homer Simpson | Snowball II |
| Harry Potter | Hedwig |
| Ron Weasley | Scabbers |
| Hermione Granger | Crookshanks |
+------------------+-----------------------+
Our goal for this reading is to understand and feel comfortable with queries like that one. If that query already makes perfect sense, great, but that's unlikely. Hopefully, it will make sense by the end.
WMDB¶
I strongly suggest you take a brief detour to read about the WMDB. It describes several tables that must be joined for maximum usefulness. We will be using it a lot this semester; it's a good idea to get familiar with it.
Join is dynamic¶
Note that join is a dynamic or virtual operation. A join
"connects" one table with another (such as the owner
and pet
tables) temporarily, just for the duration of one query. It uses
the underlying data structures, but it does not modify them.
Primary Keys¶
A primary key is a column (or set of columns) that uniquely identifies
a row. In the IMDB and the WMDB, there may possibly be more than one
"George Clooney," but there cannot be more than one row with
NM=123. NM
is the primary key.
You can probably think of many examples of a key from real life. For example, social security numbers were invented as a unique ID for each American eligible for social security. Similarly, B-numbers at Wellesley.
A key does not have to be a number. Your email address uniquely identifies you.
(For maximum efficiency, keys should be short and fixed-length, but they need not be.)
Keys and Foreign Keys¶
The pet
table needs a unique identifier for each pet. This will just be
an arbitrary number, unique to each pet. This will be a column in the
pet
table; we'll call it pid
.
Because the pid
uniquely identifies a pet, the pid
will be the
primary key to the pet
table.
Similarly, the owner
table needs a unique identifier for each pet
owner. We'll call that oid
and it will also be the primary key for
that table.
Now, what connects those two tables? How do we know that Hedwig's owner is Harry, while Crookshank's owner is Hermione? The answer is that
each pet record will have the OID of the pet's owner
Because the OID is a key in another table, it's called a foreign key.
Note that the OID is probably not a key in the pet
table. For example,
if Homer Simpson has three pets (Santa's Little Helper, Snowball and
Snowball II), each of those pets will have Homer's OID in their records,
to represent the fact that Homer is their owner. Obviously, if three
records in the pet
table have Homer's OID in them, the OID can't be a
unique identifer for a row.
If you recall your terminology from math classes, the mapping between owners and pets is a one to many mapping, since each owner can have many pets but each pet has one owner.
(At least the "only one owner" is a rule as we have described this. Obviously, we could make room in our representation to have Marge Simpson as a co-owner of the Simpson pets, but we won't do that. We'll stick to this one-to-many (1:N) mapping.)
Joins¶
If we want a query that draws information from both the owner
table and
the pet
table at the same time, that's called a join.
Remember that joining is a virtual operation: the joining happens during a query and does not affect the representation of either table.
A join connects one table with another, say pet
and owner
.
A join connects a row of one table with a row in another, say "Crookshanks" and "Hermione", or "Santa's Little Helper" and "Homer Simpson".
The result of a join is a "wider" set of columns, drawing columns from both tables. Thus, it's as if we had one wide table with all the columns of the two tables, and we are selecting from that wide virtual table.
Which rows do we choose? If it's a join, we typically have a join
condition, which is a boolean condition that says that the row from one
table belongs with the row from the other table. For example, we want the
row where the data from the owner
table is about Harry Potter and the
data from the pet
table is about Hedwig.
Mappings¶
These connections (mappings) can be:
- one-to-one (rare)
- one-to-many (common)
- many-to-many (common)
We'll use this terminology a lot.
Movie Directors¶
In the WMDB, a movie has only one director, but a person can direct more than one movie.
This is a one-to-many (1:N) relationship
Here's a picture of some connections between Peter Jackson and three movies he directed (the three Lord of the Rings movies) and also Kathryn Bigelow and a movie she directed ("The Hurt Locker"):
Foreign Keys¶
We can implement a 1:N relationship by storing the key for the person (the director) in the row for that movie in the movie table:
nm | Name |
---|---|
1392 | Peter Jackson |
941 | Kathryn Bigelow |
tt | director | Title |
---|---|---|
120737 | 1392 | Fellowship of the Ring |
167261 | 1392 | Two Towers |
167262 | 1392 | Return of the King |
887912 | 941 | The Hurt Locker |
Backwards Representation!¶
- Compared to what we are used to in programming, this is backwards!
- We would usually represent a person and have a list of the movies they directed stored with the person.
- We don't have lists in relational databases.
- Here, the person has no list, but each movie has one director id
- The director id is a foreign key: a key in some other table
- Notice director can't be a key in the movie table, because there are repeats
- Director can be a key in the person table, since it's their ID (NM)
Misconceptions¶
This representation is the opposite of what we would probably do in most programming situations. Consequently, this a common misconception in this database class: you may be tempted to store a list of related items in a row when there's a 1-many mapping. But this is not how relational databases are intended to work.
In a one-many mapping, each of the many keeps a pointer to the one.
There are many reasons for this counter-intuitive situation. Here are a few:
- Database tables are most efficiently operated if elements are of fixed size. Storing a list of movies in the director would make that a variable-length field.
- It's efficient to look up something using its primary key. The primary key for the person table is the NM, so looking up the director of a movie is guaranteed to be efficient.
- Databases are implemented so that joins are as efficient as possible. (That doesn't make joins free, and some newer non-relational databases technologies eschew joins. We'll talk about those much later.)
We will re-visit this concept much later in the course when we discuss normalization.
Queries with 1:N Relationships¶
Imagine we have a query that selects from two tables at the same time:
SELECT * FROM person, movie LIMIT 10;
This pairs every person with every movie. (A cross-product.)
But, if we use a WHERE clause to choose just the pairs we want, we get sensible results:
SELECT * FROM person, movie
WHERE nm = director;
Looking up Movie Directors¶
This example lists all movie,director pairs in the WMDB:
select * from person,movie
where nm = director
This example lists Peter Jackson and Kathryn Bigelow movies:
select * from person,movie
where nm = director
and nm in (941, 1392);
Aside about Cross Products¶
As you probably remember from math class, a cross product is every possible pair from two sets.
If set A = {a,b,c} and set B = {1,2,3}, then
A x B = {(a,1), (a,2), (a,3), (b,1), (b,2), (b,3), (c,1) ,(c,2), (c,3)}
That 9 pairs! (Why 9? 3x3).
Above, when we had a query like
select * from person,movie
where nm = director;
That's a subset of this simpler query that simply pairs every movie with every person: a cross-product:
select * from person,movie;
Self-Crosses¶
We can even cross a table with itself!
If we did
select A.name, B.name
from person A, person B
we cross the person
table with itself. Here, we give a temporary
name (an alias) to each occurrence of the person
table, so that we
can refer to them. Here, we just had aliases A
and B
.
The query above would find all possible pairs of people. Maybe they could be co-stars?
You might notice that every person will be paired with themself as well. If we want to forbid those nonsense pairs, we could do this:
select A.name, B.name
from person A, person B
where A.nm <> B.nm
Many-Many Relationships¶
We now understand how to use join to handle one-to-many mappings, what about many-to-many mappings? Our prototypical example of that for this course will be acting credits. As you know, IMDB will tell you the cast of any movie (a list of people who act in that movie) and and the filmography of any actor (a list of the movies they have acted in). Together, those are a many-to-many mapping between people and movies.
Here's a diagram of two actors (Jimmy Stewart and Katharine Hepburn) and three movies they were in (It's a Wonderful Life, The Philadelphia Story, and The Lion in Winter). They were both in The Philadelphia Story.
Represent Many-Many Relationships with an Intermediate Table¶
With an intermediate table, we can represent a many-many relationship. The table holds pairs of foreign keys.
A pair (P,M) means that Person P acted in Movie M.
Here's a picture of an intermediate table holding the four facts that Jimmy Stewart was in both It's a Wonderful Life and The Philadelphia Story, and Katharine Hepburn was in The Philadelphia Story and the Lion in Winter:
This intermediate table is called the credit
table in the WMDB.
Other Attributes in an Intermediate Table¶
We can add information to the intermediate table that belongs to the pair (P,M) rather than just to one or the other. For example, in the WMDB, we could store the name of the role that the actor played in that movie.
For example, Katharine Hepburn played Eleanor of Aquitaine in "The
Lion in Winter". The name "Eleanor of Aquitaine" doesn't belong in
either Hepburn's entry in the person
table, since she played many
parts, nor in the "Lion in Winter" entry in the movie
table, since
there are many roles in that movie. It belongs to the pair.
What other attributes can you think of for the (P,M) pairs?
Primary Key in the Intermediate Table¶
Note that to uniquely specify a row in the credit
table we need
two values: an NM and a TT. We'll learn more about this when we talk
about creating tables.
Looking up Acting Credits¶
To use the credit table, we employ the cross-product idea again, with both the movie and person tables.
This time, though, the NM and TT fields are ambiguous, because they appear in both tables. So we have to use the table name in the expression:
SELECT *
FROM person, credit, movie
WHERE person.nm = credit.nm
AND credit.tt = movie.tt;
Special Join Notations¶
Joins are so common and important in Relational Databases that there is special support for them in the SQL language. Both notations move the join condition (the condition that says which rows to match between tables) out of the WHERE clause and into the FROM clause. Typically, the WHERE clause will have other interesting stuff in it, such as conditions on the pets or owners other than the join condition, so it's nice to separate that.
Join ON¶
The first option is to allow you to write a general boolean as the join condition, using the ON keyword, followed by the boolean condition.
This example joins movies with their directors:
SELECT * FROM person INNER JOIN movie ON (person.nm = movie.director)
(An INNER JOIN
always omits rows that don't have a match in the
other table. It's in contrast to an outer join
where you could
also list movies without directors and people who have never
directed. We will talk more about outer joins later in the course, but
for now, we'll focus on inner joins.)
Another example is the vet office:
SELECT * FROM owner INNER JOIN pet ON (owner.oid = pet.oid) ...
We could then add conditions in the WHERE clause about whether the animal needs shots, whether the owner has asked for notifications, and so forth.
Join USING a shared field¶
Joining two tables using equality on a field that has the same name in both tables is so common that there's a special syntax for that. The following partial query means the same thing as the preceding one:
SELECT * FROM owner INNER JOIN pet USING (oid) ...
Very nice and succinct! We could find the cast of every movie named "Hamlet" like this:
SELECT name
FROM person
INNER JOIN credit USING (nm)
INNER JOIN movie USING (tt)
WHERE title = 'Hamlet';
Natural Joins¶
We learned that the using
syntax allows us to specify a join
condition that means equality on a column whose name appears in both
tables. A special case of this is a natural join, which is equality
on all such columns. Consider:
mysql> use cs304_db; select * from owner natural join pet;
Empty set (0.00 sec)
Why no results? Because we are asking not only that the OID fields be equal, but also that the NAME fields be equal, and it happens that none of the pets has the same name as its owner.
I strongly advise avoiding NATURAL JOIN, since you can inadvertently
require equality of columns you didn't intend to. Better to list the
shared field, via the using
keyword.
Summary¶
- joins are a virtual connection between two tables
- each row is connected to one row in the other: a cross product
- we can think of a JOIN as a subset of the cross product
- we can use the WHERE clause to specify the join condition
- in a one:many mapping, each of the many specifies the one, typically with a key. Think of the director example
- a key uniquely defines a row in a table
- a many:many mapping is implemented with an intermediate table with pairs of foreign keys. Think of the credit example
- a foreign key is a key in some other table
- special syntaxes:
- A inner join B ON (boolean expression)
- A inner join B USING (shared-column)