Referential Integrity

We've seen that JOINs are essential to relational databases. Most joins are based on the idea of a foreign key: a value that is a key in some other table. For example, the OID (owner id) in the pet table is a foreign key into the owner table, where oid is the primary key. The pet table refers to the owner table.

Referential Integrity is the idea that these references should all have valid values: If the oid for a pet ("Fluffy") is 456, then 456 should be a valid key in the owner table. (The DBMS can't ensure that 456 is the correct owner, but it should at least be the key for some owner.)

Today's topic is referential integrity.

Readings

The main topics for today are:

  • Indexes
  • Referential integrity
  • Query Execution

Indexes

We know that most tables should specify a primary key --- a column or set of columns that uniquely specify a row. Furthermore, the primary key is how the rows are organized in the underlying data structure, and searching the table by a primary key is guaranteed to be the fastest way to search it.

Some examples of primary keys we had from the WMDB was the numeric ID for movies (tt) or people (nm). However, we know that we will often want to search those tables by the title of the movie or the name of the person, even if those search strings are not unique. How can we make those searches faster? We can build an index.

An index is a pretty intuitive concept, because we are all used to using an index at the back of a book. Finding page 254 is fast; so a page number is like a primary key. The index might tell you that the name Clooney appears on page 254, so you can find information about him faster by using the index than by leafing through the whole book.

Let's talk about indexes in terms of computer science concepts that you already know. Suppose we have an array of songs, where each song is represented with an object with fields like title, artist, length, etc. Suppose it's an enormous array of up to 1 million songs: song[1_000_000] The primary key is the array index: song[123_456] is a very fast way to find that song, given its location in the array. If we had a little hashtable on the side that hashed "Feel it Still" to 123_456, we can look up the song by title almost as quickly, since hashtables are O(1) lookup in the average case. Otherwise, we'd have to iterate through the whole array, searching for the song by title, which is an O(n) operation.

When we ask MySQL to construct an index on some field (column), it sets up something like that that hashtable mapping song titles to array locations.

Syntactically, we can have MySQL build an maintain an index on a particular column just by requesting it by adding an index (col) clause to the create table.

Here's how we might have done it in the WMDB to index by the person's name. In principle, we could add additional indexes for other columns as well, such as by birthdate or who added the person. That could speed up queries using those columns.

use scottdb;

drop table if exists person9;
create table person9 (
       nm int unsigned,
       name varchar(40),
       birthdate date,
       primary key (nm),
       index(name)
);

drop table if exists movie9;
create table movie9 (
       tt int unsigned,
       title varchar(40),
       director int unsigned,
       primary key (tt),
       index(title)
);

Foreign Keys and Referential Integrity

We often have tables that are knitted together by a common key, as in our vet office, where the rows of the pet table refer to the owner table using a foreign key of the oid (owner id). Another example, described thoroughly below, would be a music database, where one table lists the albums and another lists the tracks. There should be an owner for every pet, and there should be an album for every track. It would be weird and meaningless to have tracks in the second table that didn't belong to an album, right? (Well, some music is released only as singles, but we'll ignore that.) Avoiding this is what we mean by referential integrity.

Databases ought to enforce this foreign key constraint, but historically, MySQL hasn't bothered. (That linked webpage describes MySQL's deviation from SQL standards.) However, in newer versions, including ours, you can accomplish this.

Well, ought is a strong word. Whether you use the DBMS to enforce referential integrity is up to you, but the concept and the goal are both important: you don't want to have orphaned, meaningless, irrelevant data in your database. At best, it's a waste of space and time, and at worst, it will cause your queries to break or produce flawed results. So, even if you don't make the DBMS enforce this integrity constraint, you can use your scripts to ensure that when a track is added, it belongs to a album.

To have MySQL enforce referential integrity, here's what you have to do:

  1. Request that both tables be represented using InnoDB. (This has to do with the lower-level representation of tables and files and won't concern us.)
  2. Request that INDEXes be built in each table for the shared key. This allows the integrity to be checked quickly
  3. Specify the foreign key (it's a foreign key because it's a key in the other table, not necessarily in this one) and the action to take under different circumstances.

What does that last part mean? There are two things that might happen in the "main" table (the one referenced).

  1. The key might be changed (updated). Depending on what the key is and how the database is used, this might be unlikely. For example, if the key is someone's SSN, it's unlikely to change (though possible). But people do change their names and addresses, and those might be keys. People might close a bank account and re-open another one, and we might want to think of that as changing a key.
  2. The key might be deleted. This might create "orphans" in the referencing table unless we decide what to do.

Here are some options about what to do when a key is updated or deleted:

cascade
delete all the would-be orphan rows in the other table, too
restrict
prevent the deletion if it would create orphans. This means that you'd have to delete the rows from the subordinate table, first.
set null
this sets the key fields in the subordinate table to null. This might be useful if you want to keep the orphans, but don't want any confusion about what parent they belong to.
no action
We'll figure out what this does.

Another aspect of referential integrity is that you cannot insert a record into a subordinate table without having a valid key in the main table. That is, no pets without owners, no tracks without CDs, etc.

Music Database Example

In this example, we have a database of music, consisting of "tracks" and "albums". There's a 1:many mapping between albums and tracks, because each track belongs to one album, but an album can have many tracks. Each track will have a foreign key into the album table.

The goal of referential integrity in this case is to make sure that we don't have tracks that don't belong to any album. That is, every track has an album, so the foreign key is valid.

Here's how to do it.

use cs304_db;

-- drop in reverse order
drop table if exists track;
drop table if exists album;

create table album ( 
    aid int auto_increment not null primary key, 
    title varchar(50),
    artist varchar(50)
    ) 
ENGINE = InnoDB;

create table track ( 
    aid int not null, 
    tid tinyint not null, -- track id, like 1, 2, 3
    title varchar(50), 
    primary key (aid,tid),
    -- necessary for ref integ
    INDEX (aid),
    foreign key (aid) references album(aid) 
        on update restrict 
        on delete restrict
    ) 
ENGINE = InnoDB;

First, you will notice that we drop the track table before the album table. That's because the point of referential integrity is to make sure that all keys are valid, and if we dropped the album table first, it would immediately render all the foreign keys in the track table invalid. So MySQL prevents that. You'll get a referential integrity error if you try to drop them in the same order that you create them.

The album table looks pretty normal except that it now ends with a engine = InnoDB clause. This determines the way that MySQL represents the table (the underlying data structure). The InnoDB type has additional bookkeeping stuff in the representation to handle (among other things), referential integrity checking. It's a little like how in Java we can choose to represent a list as a <LinedList> or <ArrayList>; both implement the List interface, but with performance differences. We'll see another use of InnoDB later in the semester.

The track table as a few new pieces. First, it creates an index on aid, the foreign key:


      ...
    INDEX (id),     -- necessary for referential integrity checking 

This allows MySQL to find the matching tracks quickly. For example, if we were to to delete 21 from the album table, MySQL needs to find out if there are any tracks from that album in this table, so that it could prevent the album deletion or delete the tracks or whatever behavior we need.

Next, it has the foreign key clause:


    ...
    foreign key (aid) references album(aid) 
        on update restrict 
        on delete restrict  
    ) 

This clause says that if the album is either deleted or its id is updated and there are tracks in this table, the operation should be disallowed.

To test this, try to delete an album that has tracks. It should be refused. Delete the tracks, then try to delete the cd. It will succeed. Another thing to try is to change "on delete restrict" to "on delete cascade" and then try.

That's the basics of referential integrity.

If you're curious about the data in our album and track tables, here's the data:

use cs304_db;

insert into album(aid,title,artist)
values
    (1,'21','Adele'),
    (2,'Lemonade','Beyonce'),
    (3,'Shatter Me','Lindsey Stirling');

-- https://en.wikipedia.org/wiki/21_(Adele_album)#Track_listing
-- https://en.wikipedia.org/wiki/Lemonade_(Beyonc%C3%A9_album)#Track_listing
-- https://en.wikipedia.org/wiki/Shatter_Me_(album)#Track_listing

insert into track(aid,tid,title)
values
    (1,1,'Rolling in the Deep'),
    (1,2,'Rumour Has It'),
    (1,3,'Turning Tables'),
    (1,5,'Set Fire to the Rain'),
    (1,11,'Someone Like You'),
    (2,1,'Pray You Catch Me'),
    (2,2,'Hold Up'),
    (3,1,'Beyond the Veil'),
    (3,4,'Shatter Me');

select artist,album.title,track.title
from album inner join track using (aid);

and you can try some queries in the webdb database.

More on Referential Integrity

Here are some useful web links on Referential Integrity if you'd like a different author. All of these are pretty short, but should help to orient you to the basic concepts. Let me know if you find them confusing or find better articles.

Choosing a Query Syntax

When you could use either joins or subqueries, which should you choose? Considerations:

  • Expressivity. Which one seems easiest to get correct?
  • Negation. Notice that listing actors who don't have a movie credit is difficult with the join method, but it's easy using the subquery method and the EXISTS operator: just insert a NOT.
  • Efficiency: You should probably not consider efficiency.

Efficiency of Query Execution (skippable)

In most cases, you should not care about efficiency. High performance databases will choose an execution plan that is efficient for the query, regardless of how you write it.

How do they do that? We'll discuss that on a later day, but essentially they:

  • analyze the underlying algebraic structure,
  • consider the available indices to see what operations are efficient.
  • estimate the number of tuples at various stages of the execution
  • estimate the execution costs (time/space) of different execution plans, and
  • choose the best

A few of those steps need some elaboration. What do we mean, for example, by available indices? Well, as we saw earlier, MySQL will create and maintain indexes for us if we request them, and it can use them to make queries faster.

For example, if there is an index in movie on director, it's either O(1) or O(log N) to look up the movies directed by a particular id, but if there's no index, it's O(n), which is huge.

Another step above was estimate the number of tuples at various stages. For example, a cross-product of the person and movie tables could be huge, while an inner join using director would be much smaller, since even successful directors only direct a few dozen movies.

Some of those decisions require human assistance, so experienced SQL programmers can intervene and give hints, but we will only be able to introduce the topic in this course.