Joins

One of the most important operations on relational databases is to join two tables.

Joining is how relational databases represent things like lists.

Vet Office¶

Suppose each customer of a vet office has one or more pets. That is, we might think of them having a list of pets.

Still, each pet will have a lot of information associated with it: a very wide database row. That row will exist in a table where each row is about a single pet. We might call that table pet

We will have a separate table of owners, with their name, address, phone number, how much they owe the vet office, and so forth. That information will exist in a table where each row is about a single owner. Each owner probably has an Owner ID number, which we will call oid. We might call that table owner.

Each pet will keep track of its owner by storing the OID of its owner.

So, to have a query that reports information on both a pet and its owner, we need to join the two tables. For example, to list the name of both the pet and its owner, we might do something like this:

SELECT owner.name, pet.name
FROM owner INNER JOIN pet USING (oid);

The results might look like this:

MariaDB [cs304_db]> select owner.name, pet.name from owner inner join pet using (oid);
+------------------+-----------------------+
| name             | name                  |
+------------------+-----------------------+
| Homer Simpson    | Santa's Little Helper |
| Homer Simpson    | Snowball              |
| Homer Simpson    | Snowball II           |
| Harry Potter     | Hedwig                |
| Ron Weasley      | Scabbers              |
| Hermione Granger | Crookshanks           |
+------------------+-----------------------+

Our goal for this reading is to understand and feel comfortable with queries like that one. If that query already makes perfect sense, great, but that's unlikely. Hopefully, it will make sense by the end.

WMDB¶

I strongly suggest you take a brief detour to read about the WMDB. It describes several tables that must be joined for maximum usefulness. We will be using it a lot this semester; it's a good idea to get familiar with it.

Join is dynamic¶

Note that join is a dynamic or virtual operation. A join "connects" one table with another (such as the owner and pet tables) temporarily, just for the duration of one query. It uses the underlying data structures, but it does not modify them.

Primary Keys¶

A primary key is a column (or set of columns) that uniquely identifies a row. In the IMDB and the WMDB, there may possibly be more than one "George Clooney," but there cannot be more than one row with NM=123. NM is the primary key.

You can probably think of many examples of a key from real life. For example, social security numbers were invented as a unique ID for each American eligible for social security. Similarly, B-numbers at Wellesley.

A key does not have to be a number. Your email address uniquely identifies you.

(For maximum efficiency, keys should be short and fixed-length, but they need not be.)

Keys and Foreign Keys¶

The pet table needs a unique identifier for each pet. This will just be an arbitrary number, unique to each pet. This will be a column in the pet table; we'll call it pid.

Because the pid uniquely identifies a pet, the pid will be the primary key to the pet table.

Similarly, the owner table needs a unique identifier for each pet owner. We'll call that oid and it will also be the primary key for that table.

Now, what connects those two tables? How do we know that Hedwig's owner is Harry, while Crookshank's owner is Hermione? The answer is that

each pet record will have the OID of the pet's owner

Because the OID is a key in another table, it's called a foreign key.

Note that the OID is probably not a key in the pet table. For example, if Homer Simpson has three pets (Santa's Little Helper, Snowball and Snowball II), each of those pets will have Homer's OID in their records, to represent the fact that Homer is their owner. Obviously, if three records in the pet table have Homer's OID in them, the OID can't be a unique identifer for a row.

If you recall your terminology from math classes, the mapping between owners and pets is a one to many mapping, since each owner can have many pets but each pet has one owner.

(At least the "only one owner" is a rule as we have described this. Obviously, we could make room in our representation to have Marge Simpson as a co-owner of the Simpson pets, but we won't do that. We'll stick to this one-to-many (1:N) mapping.)

Joins¶

If we want a query that draws information from both the owner table and the pet table at the same time, that's called a join.

Remember that joining is a virtual operation: the joining happens during a query and does not affect the representation of either table.

A join connects one table with another, say pet and owner.

A join connects a row of one table with a row in another, say "Crookshanks" and "Hermione", or "Santa's Little Helper" and "Homer Simpson".

The result of a join is a "wider" set of columns, drawing columns from both tables. Thus, it's as if we had one wide table with all the columns of the two tables, and we are selecting from that wide virtual table.

Which rows do we choose? If it's a join, we typically have a join condition, which is a boolean condition that says that the row from one table belongs with the row from the other table. For example, we want the row where the data from the owner table is about Harry Potter and the data from the pet table is about Hedwig.

Mappings¶

These connections (mappings) can be:

one-to-one (rare)
one-to-many (common)
many-to-many (common)

We'll use this terminology a lot.

Movie Directors¶

In the WMDB, a movie has only one director, but a person can direct more than one movie.

This is a one-to-many (1:N) relationship

Here's a picture of some connections between Peter Jackson and three movies he directed (the three Lord of the Rings movies) and also Kathryn Bigelow and a movie she directed ("The Hurt Locker"):

Movie Directors

Foreign Keys¶

We can implement a 1:N relationship by storing the key for the person (the director) in the row for that movie in the movie table:

nm	Name
1392	Peter Jackson
941	Kathryn Bigelow

tt	director	Title
120737	1392	Fellowship of the Ring
167261	1392	Two Towers
167262	1392	Return of the King
887912	941	The Hurt Locker

Backwards Representation!¶

Compared to what we are used to in programming, this is backwards!
We would usually represent a person and have a list of the movies they directed stored with the person.
We don't have lists in relational databases.
Here, the person has no list, but each movie has one director id
The director id is a foreign key: a key in some other table
Notice director can't be a key in the movie table, because there are repeats
Director can be a key in the person table, since it's their ID (NM)

Misconceptions¶

This representation is the opposite of what we would probably do in most programming situations. Consequently, this a common misconception in this database class: you may be tempted to store a list of related items in a row when there's a 1-many mapping. But this is not how relational databases are intended to work.

In a one-many mapping, each of the many keeps a pointer to the one.

There are many reasons for this counter-intuitive situation. Here are a few:

Database tables are most efficiently operated if elements are of fixed size. Storing a list of movies in the director would make that a variable-length field.
It's efficient to look up something using its primary key. The primary key for the person table is the NM, so looking up the director of a movie is guaranteed to be efficient.
Databases are implemented so that joins are as efficient as possible. (That doesn't make joins free, and some newer non-relational databases technologies eschew joins. We'll talk about those much later.)

We will re-visit this concept much later in the course when we discuss normalization.

Queries with 1:N Relationships¶

Imagine we have a query that selects from two tables at the same time:

SELECT * FROM person, movie LIMIT 10;

This pairs every person with every movie. (A cross-product.)

But, if we use a WHERE clause to choose just the pairs we want, we get sensible results:

SELECT * FROM person, movie
WHERE nm = director;

Looking up Movie Directors¶

This example lists all movie,director pairs in the WMDB:

select * from person,movie
where nm = director

This example lists Peter Jackson and Kathryn Bigelow movies:

select * from person,movie
where nm = director
and nm in (941, 1392);

Aside about Cross Products¶

As you probably remember from math class, a cross product is every possible pair from two sets.

If set A = {a,b,c} and set B = {1,2,3}, then

A x B = {(a,1), (a,2), (a,3),
         (b,1), (b,2), (b,3),
         (c,1) ,(c,2), (c,3)}

That 9 pairs! (Why 9? 3x3).

Above, when we had a query like

select * from person,movie
where nm = director;

That's a subset of this simpler query that simply pairs every movie with every person: a cross-product:

select * from person,movie;

Self-Crosses¶

We can even cross a table with itself!

If we did

select A.name, B.name 
from person A, person B

we cross the person table with itself. Here, we give a temporary name (an alias) to each occurrence of the person table, so that we can refer to them. Here, we just had aliases A and B.

The query above would find all possible pairs of people. Maybe they could be co-stars?

You might notice that every person will be paired with themself as well. If we want to forbid those nonsense pairs, we could do this:

select A.name, B.name 
from person A, person B
where A.nm <> B.nm

Many-Many Relationships¶

We now understand how to use join to handle one-to-many mappings, what about many-to-many mappings? Our prototypical example of that for this course will be acting credits. As you know, IMDB will tell you the cast of any movie (a list of people who act in that movie) and and the filmography of any actor (a list of the movies they have acted in). Together, those are a many-to-many mapping between people and movies.

Here's a diagram of two actors (Jimmy Stewart and Katharine Hepburn) and three movies they were in (It's a Wonderful Life, The Philadelphia Story, and The Lion in Winter). They were both in The Philadelphia Story.

actors and movies

Represent Many-Many Relationships with an Intermediate Table¶

With an intermediate table, we can represent a many-many relationship. The table holds pairs of foreign keys.

A pair (P,M) means that Person P acted in Movie M.

Here's a picture of an intermediate table holding the four facts that Jimmy Stewart was in both It's a Wonderful Life and The Philadelphia Story, and Katharine Hepburn was in The Philadelphia Story and the Lion in Winter:

actors, credits and movies

This intermediate table is called the credit table in the WMDB.

Other Attributes in an Intermediate Table¶

We can add information to the intermediate table that belongs to the pair (P,M) rather than just to one or the other. For example, in the WMDB, we could store the name of the role that the actor played in that movie.

For example, Katharine Hepburn played Eleanor of Aquitaine in "The Lion in Winter". The name "Eleanor of Aquitaine" doesn't belong in either Hepburn's entry in the person table, since she played many parts, nor in the "Lion in Winter" entry in the movie table, since there are many roles in that movie. It belongs to the pair.

actor roles in the credit table

What other attributes can you think of for the (P,M) pairs?

Primary Key in the Intermediate Table¶

Note that to uniquely specify a row in the credit table we need two values: an NM and a TT. We'll learn more about this when we talk about creating tables.

Looking up Acting Credits¶

To use the credit table, we employ the cross-product idea again, with both the movie and person tables.

This time, though, the NM and TT fields are ambiguous, because they appear in both tables. So we have to use the table name in the expression:

SELECT * 
FROM person, credit, movie
WHERE person.nm = credit.nm 
  AND credit.tt = movie.tt;

Special Join Notations¶

Joins are so common and important in Relational Databases that there is special support for them in the SQL language. Both notations move the join condition (the condition that says which rows to match between tables) out of the WHERE clause and into the FROM clause. Typically, the WHERE clause will have other interesting stuff in it, such as conditions on the pets or owners other than the join condition, so it's nice to separate that.

Join ON¶

The first option is to allow you to write a general boolean as the join condition, using the ON keyword, followed by the boolean condition.

This example joins movies with their directors:

SELECT * FROM person INNER JOIN movie ON (person.nm = movie.director)

(An INNER JOIN always omits rows that don't have a match in the other table. It's in contrast to an outer join where you could also list movies without directors and people who have never directed. We will talk more about outer joins later in the course, but for now, we'll focus on inner joins.)

Another example is the vet office:

SELECT * FROM owner INNER JOIN pet ON (owner.oid = pet.oid) ...

We could then add conditions in the WHERE clause about whether the animal needs shots, whether the owner has asked for notifications, and so forth.

Join USING a shared field¶

Joining two tables using equality on a field that has the same name in both tables is so common that there's a special syntax for that. The following partial query means the same thing as the preceding one:

SELECT * FROM owner INNER JOIN pet USING (oid) ...

Very nice and succinct! We could find the cast of every movie named "Hamlet" like this:

SELECT name 
FROM person 
    INNER JOIN credit USING (nm)
    INNER JOIN movie USING (tt)
WHERE title = 'Hamlet';

Natural Joins¶

We learned that the using syntax allows us to specify a join condition that means equality on a column whose name appears in both tables. A special case of this is a natural join, which is equality on all such columns. Consider:

mysql> use cs304_db; select * from owner natural join pet;
    Empty set (0.00 sec)

Why no results? Because we are asking not only that the OID fields be equal, but also that the NAME fields be equal, and it happens that none of the pets has the same name as its owner.

I strongly advise avoiding NATURAL JOIN, since you can inadvertently require equality of columns you didn't intend to. Better to list the shared field, via the using keyword.

Summary¶

joins are a virtual connection between two tables
each row is connected to one row in the other: a cross product
we can think of a JOIN as a subset of the cross product
we can use the WHERE clause to specify the join condition
in a one:many mapping, each of the many specifies the one, typically with a key. Think of the director example
a key uniquely defines a row in a table
a many:many mapping is implemented with an intermediate table with pairs of foreign keys. Think of the credit example
a foreign key is a key in some other table
special syntaxes:
- A inner join B ON (boolean expression)
- A inner join B USING (shared-column)