Referential Integrity
We've seen that JOINs are essential to relational databases. Most
joins are based on the idea of a foreign key: a value that is a key
in some other table. For example, the OID (owner id) in the pet
table is a foreign key into the owner table, where oid is
the primary key. The pet table refers to the owner table.
Referential Integrity is the idea that these references should all
have valid values: If the oid for a pet ("Fluffy") is 456, then 456
should be a valid key in the owner table. (The DBMS can't ensure
that 456 is the correct owner, but it should at least be the key for
some owner.)
Today's topic is referential integrity.
Readings¶
The main topics for today are:
- Indexes
- Referential integrity
- Query Execution
Indexes¶
We know that most tables should specify a primary key --- a column or set of columns that uniquely specify a row. Furthermore, the primary key is how the rows are organized in the underlying data structure, and searching the table by a primary key is guaranteed to be the fastest way to search it.
Some examples of primary keys we had from the WMDB was the numeric ID for
movies (tt) or people (nm). However, we know that we will often want
to search those tables by the title of the movie or the name of the
person, even if those search strings are not unique. How can we make those
searches faster? We can build an index.
An index is a pretty intuitive concept, because we are all used to using
an index at the back of a book. Finding page 254 is fast; so a page number
is like a primary key. The index might tell you that the name Clooney
appears on page 254, so you can find information about him faster by using
the index than by leafing through the whole book.
Let's talk about indexes in terms of computer science concepts that
you already know. Suppose we have an array of songs, where each song
is represented with an object with fields like title, artist, length,
etc. Suppose it's an enormous array of up to 1 million songs:
song[1_000_000] The primary key is the array index: song[123_456]
is a very fast way to find that song, given its location in the array.
If we had a little hashtable on the side that hashed "Feel it Still"
to 123_456, we can look up the song by title almost as quickly,
since hashtables are O(1) lookup in the average case. Otherwise, we'd
have to iterate through the whole array, searching for the song by
title, which is an O(n) operation.
When we ask MySQL to construct an index on some field (column), it sets up something like that that hashtable mapping song titles to array locations.
Syntactically, we can have MySQL build an maintain an index on a
particular column just by requesting it by adding an index (col)
clause to the create table.
Here's how we might have done it in the WMDB to index by the person's name. In principle, we could add additional indexes for other columns as well, such as by birthdate or who added the person. That could speed up queries using those columns.
use scottdb;
drop table if exists person9;
create table person9 (
nm int unsigned,
name varchar(40),
birthdate date,
primary key (nm),
index(name)
);
drop table if exists movie9;
create table movie9 (
tt int unsigned,
title varchar(40),
director int unsigned,
primary key (tt),
index(title)
);
Foreign Keys and Referential Integrity¶
We often have tables that are knitted together by a common key, as in our
vet office, where the rows of the pet table refer to the owner table
using a foreign key of the oid (owner id). Another example, described
thoroughly below, would be a music database, where one table lists the
albums and another lists the tracks. There should be an owner for every
pet, and there should be an album for every track. It would be weird and
meaningless to have tracks in the second table that didn't belong to an
album, right? (Well, some music is released only as singles, but we'll
ignore that.) Avoiding this is what we mean by referential integrity.
Databases ought to enforce this foreign key constraint, but historically, MySQL hasn't bothered. (That linked webpage describes MySQL's deviation from SQL standards.) However, in newer versions, including ours, you can accomplish this.
Well, ought
is a strong word. Whether you use the DBMS to enforce
referential integrity is up to you, but the concept and the goal are both
important: you don't want to have orphaned, meaningless, irrelevant data
in your database. At best, it's a waste of space and time, and at worst,
it will cause your queries to break or produce flawed results. So, even if
you don't make the DBMS enforce this integrity constraint, you can use
your scripts to ensure that when a track is added, it belongs to a album.
To have MySQL enforce referential integrity, here's what you have to do:
- Request that both tables be represented using InnoDB. (This has to do with the lower-level representation of tables and files and won't concern us.)
- Request that INDEXes be built in each table for the shared key. This allows the integrity to be checked quickly
- Specify the foreign key (it's a foreign key because it's a key in the other table, not necessarily in this one) and the action to take under different circumstances.
What does that last part mean? There are two things that might happen in the "main" table (the one referenced).
- The key might be changed (updated). Depending on what the key is and how the database is used, this might be unlikely. For example, if the key is someone's SSN, it's unlikely to change (though possible). But people do change their names and addresses, and those might be keys. People might close a bank account and re-open another one, and we might want to think of that as changing a key.
- The key might be deleted. This might create "orphans" in the referencing table unless we decide what to do.
Here are some options about what to do when a key is updated or deleted:
- cascade
- delete all the would-be orphan rows in the other table, too
- restrict
- prevent the deletion if it would create orphans. This means that you'd have to delete the rows from the subordinate table, first.
- set null
- this sets the key fields in the subordinate table to null. This might be useful if you want to keep the orphans, but don't want any confusion about what parent they belong to.
- no action
- We'll figure out what this does.
Another aspect of referential integrity is that you cannot insert a record into a subordinate table without having a valid key in the main table. That is, no pets without owners, no tracks without CDs, etc.
Music Database Example¶
In this example, we have a database of music, consisting of "tracks" and
"albums". There's a 1:many mapping between albums and tracks, because each
track belongs to one album, but an album can have many tracks. Each track
will have a foreign key into the album table.
The goal of referential integrity in this case is to make sure that we don't have tracks that don't belong to any album. That is, every track has an album, so the foreign key is valid.
Here's how to do it.
use cs304_db;
-- drop in reverse order
drop table if exists track;
drop table if exists album;
create table album (
aid int auto_increment not null primary key,
title varchar(50),
artist varchar(50)
)
ENGINE = InnoDB;
create table track (
aid int not null,
tid tinyint not null, -- track id, like 1, 2, 3
title varchar(50),
primary key (aid,tid),
-- necessary for ref integ
INDEX (aid),
foreign key (aid) references album(aid)
on update restrict
on delete restrict
)
ENGINE = InnoDB;
First, you will notice that we drop the track table before the album
table. That's because the point of referential integrity is to make sure
that all keys are valid, and if we dropped the album table first, it
would immediately render all the foreign keys in the track table
invalid. So MySQL prevents that. You'll get a referential integrity error
if you try to drop them in the same order that you create them.
The album table looks pretty normal except that it now ends with a
engine = InnoDB clause. This determines the way that MySQL
represents the table (the underlying data structure). The InnoDB
type has additional bookkeeping stuff in the representation to handle
(among other things), referential integrity checking. It's a little
like how in Java we can choose to represent a list as a <LinedList>
or <ArrayList>; both implement the List interface, but with
performance differences. We'll see another use of InnoDB later in
the semester.
The track table as a few new pieces. First, it creates an index on
aid, the foreign key:
...
INDEX (id), -- necessary for referential integrity checking
This allows MySQL to find the matching tracks quickly. For example, if we
were to to delete 21 from the album table, MySQL needs to find out if
there are any tracks from that album in this table, so that it could
prevent the album deletion or delete the tracks or whatever behavior we
need.
Next, it has the foreign key clause:
...
foreign key (aid) references album(aid)
on update restrict
on delete restrict
)
This clause says that if the album is either deleted or its id is updated and there are tracks in this table, the operation should be disallowed.
To test this, try to delete an album that has tracks. It should be refused. Delete the tracks, then try to delete the cd. It will succeed. Another thing to try is to change "on delete restrict" to "on delete cascade" and then try.
That's the basics of referential integrity.
If you're curious about the data in our album and track tables, here's the data:
use cs304_db;
insert into album(aid,title,artist)
values
(1,'21','Adele'),
(2,'Lemonade','Beyonce'),
(3,'Shatter Me','Lindsey Stirling');
-- https://en.wikipedia.org/wiki/21_(Adele_album)#Track_listing
-- https://en.wikipedia.org/wiki/Lemonade_(Beyonc%C3%A9_album)#Track_listing
-- https://en.wikipedia.org/wiki/Shatter_Me_(album)#Track_listing
insert into track(aid,tid,title)
values
(1,1,'Rolling in the Deep'),
(1,2,'Rumour Has It'),
(1,3,'Turning Tables'),
(1,5,'Set Fire to the Rain'),
(1,11,'Someone Like You'),
(2,1,'Pray You Catch Me'),
(2,2,'Hold Up'),
(3,1,'Beyond the Veil'),
(3,4,'Shatter Me');
select artist,album.title,track.title
from album inner join track using (aid);
and you can try some queries in the webdb database.
More on Referential Integrity¶
Here are some useful web links on Referential Integrity if you'd like a different author. All of these are pretty short, but should help to orient you to the basic concepts. Let me know if you find them confusing or find better articles.
- Referential Integrity at About.com
- Referential Integrity in MySQL
- An introduction to foreign keys and referential integrity in MySQL
Choosing a Query Syntax¶
When you could use either joins or subqueries, which should you choose? Considerations:
- Expressivity. Which one seems easiest to get correct?
- Negation. Notice that listing actors who don't have a movie credit is difficult with the join method, but it's easy using the subquery method and the EXISTS operator: just insert a NOT.
- Efficiency: You should probably not consider efficiency.
Efficiency of Query Execution (skippable)¶
In most cases, you should not care about efficiency. High performance databases will choose an execution plan that is efficient for the query, regardless of how you write it.
How do they do that? We'll discuss that on a later day, but essentially they:
- analyze the underlying algebraic structure,
- consider the available indices to see what operations are efficient.
- estimate the number of tuples at various stages of the execution
- estimate the execution costs (time/space) of different execution plans, and
- choose the best
A few of those steps need some elaboration. What do we mean, for example, by available indices? Well, as we saw earlier, MySQL will create and maintain indexes for us if we request them, and it can use them to make queries faster.
For example, if there is an index in movie on director, it's either
O(1) or O(log N) to look up the movies directed by a particular id, but if
there's no index, it's O(n), which is huge.
Another step above was estimate the number of tuples at various
stages
. For example, a cross-product of the person and movie
tables could be huge, while an inner join using director would be much
smaller, since even successful directors only direct a few dozen movies.
Some of those decisions require human assistance, so experienced SQL programmers can intervene and give hints, but we will only be able to introduce the topic in this course.