Entity Relationships Diagrams¶

ER diagrams are a useful way to picture the entities that we are storing in our database (e.g. people, movies, pets, owners, albums, tracks) and the relationships among them. These are consequently called Entity Relationship Diagrams or ER Diagrams.

Diagrams are useful because people often think pictorially, and they find pictures helpful in clarifying how things interact. Furthermore, diagrams can helpful explaining the entities and relationships to non-technical people (such as managers) who can't read a line of SQL.

There are a variety of diagram techniques that are commonly used. This semester, Fall of 2022, we are going to switch to a new one. If you're interested in the old one, you're welcome to read more about the former ER diagrams. As always, there are advantages and disadvantages, but I think the advantages outweigh the disadvantages.

The reason for the change is that I've recently learned of a very cool website in which you can type in SQL code (similar, but not identical, to the SQL code we've been learning) and the website will draw the diagram for you! You can then export the diagram (as PNG or PDF), send someone a link to the page, or whatever.

Furthermore, you can export the SQL code that you typed in. You can even import SQL code you've already written, and it will create the diagram from it. You can even edit the SQL more. All in all, this is very exciting and I think it will make your life easier in CS 304.

The website is dbdiagram.io. You can create a free account just by giving them your email address. There's also a paid account if you need to keep your diagrams private. A free account will be sufficient for this course.

Entity-Relationship Model
Terminology Pedantry
Keys
Vet Office
ENUMs
Exporting SQL
One-to-Many Versus Many-to-One
Defining One-to-Many Relationships by Dragging
Many-to-Many Relationships
One-to-One Relationships
ISA Hierarchies
Auto Increment
Conclusion

Entity-Relationship Model¶

Here's an ER diagram for the WMDB, using dbdiagram.io. Here's the direct link to WMDB diagram

You can see each box represents an entity. The box gives the name of the entity and a list of its attributes (fields, columns) and their datatypes. You also see lines connecting the boxes. Those lines represent relationships between the entities.

Let's first talk about the entities (boxes) and then the relationships (lines).

The person box represents, of course, that entity. In the database, it'll be stored as a table. Notice that the nm field is bold. That's because it's the primary key of the table.

The movie box is similar. It has different fields, of course, but the first field, tt is bold, because it's the primary key.

Now let's turn to one of the lines.

There is a line connecting the director field of the movie entity with the nm field of the person entity. That means that the values of the director field will be nm values. For example, if there's a movie where the director field is 17, that means that 17 is the nm (the id) of some person, and the person with that nm directed that movie. Thus, the line represents the "directs" relationship between movies and people.

There's more information in the diagram for that line. At the director end of the line, there's an asterisk, while at the nm end, there's the number 1. That means this is a one-to-many relationship: a person can direct many movies, but a movie only has 1 director.

This representation takes some getting used to: we sometimes think of a star (asterisk) as meaning "many", and it does, but here it does not mean that a movie has many directors. Instead it means that the person-movie relationship is 1-many, with the 1 at the "person" end and the many ("*") at the "movie" end.

Take some time to review the picture. You'll notice that all of our relationships are 1-many. That's actually very common. Later, we'll learn about 1-1 and many-many relationships.

I hope you'll agree that the diagram is helpful and easier to read and understand than the SQL code.

Terminology Pedantry¶

We've talked a lot about entities and relationships. "Oppenheimer" is an entity and so is Barbie, and they are both in the Movie table, which is an entity set. That's the traditional terminology. So the boxes above are entity sets.

It's also traditional to use singular words for the names of the sets. That often makes queries easier to read:

-- movies directed by people born in 1961
select name, title 
from person inner join movie on (person.nm = movie.director)
where year(person.birthdate) = 1961;
...

because the where clause is talking about a single person not multiple people.

Similarly, the connection between Greta Gerwig and Barbie is a relationship (Greta Gerwig directed "Barbie"). Also, Christopher Nolan directed "Oppenheimer"; that's another relationship. So the line between People and Movie represents a relationships set.

Thus, we can distinguish between a particular entity and an entity set and between a particular relationship and a relationship set. However, in practice, there are lots of contexts were the words are used less carefully, and we figure out what is meant by context.

Keys¶

Before we go further into ER diagrams, let's talk a bit about keys. Entities usually have keys. A key is

a set of attributes that uniquely specifies an entity. E.g. SSN, or (FirstName,LastName,ClassYear)
no subset of those attributes works

You'll notice that we say a set of attributes, so it might take more than one attribute to uniquely specify an entity. Most of the time, we'll have a single attribute, but later in this reading, we'll see examples of composite keys.

Other observations:

The key should be unique by definition, not by happenstance. A person's name could be unique at Wellesley, until someone else with that name matriculates.
There may be more than one key. Can you think of two or more keys for a Wellesley student?
The database should enforce the key constraint. That is, the database should prevent someone from entering another student with the same B-number as an existing student.

Vet Office¶

Now let's create our first diagram, for a vet office. Recall that there were two entities: owners and pets. Let's keep it simple and just have a few attributes for each. You already know about attributes, and our focus right now is the ER diagram.

In the dark area to the left, type in the following:

table owner {
    oid int [pk]
    name varchar(50)
}

You'll see that a box is created with two fields, with oid in bold because it's the primary key (pk). Excellent! Like this:

vet office step 1

Note that the "code" we typed in doesn't match our SQL syntax. It is, in fact, Database Markup Language (DBML). See DBML.org. For example, the attributes are surrounded by { braces } instead of ( parentheses ). In many ways, the DBML syntax is simpler and easier, but the differences will be occasionally annoying or confusing. Fortunately the syntax highlighting in dbdiagram.io is quite good, so when I forget and type a comma at the end of the line, it will immediately alert me to my mistake, and it's easily fixed.

Now, let's create the pet table. Type in the following:

table pet {
    pid int [pk]
    oid int [ref: > owner.oid]
    name varchar(30)
}

This creates another box and connects the two with a line for the 1-to-many relationship. Perfect! You can drag the boxes around to get a nicer layout, maybe like this:

vet office step 2

A syntax note: the syntax owner.oid means the oid field (attribute) in the owner table. That dot syntax is actually standard: you can use the same syntax in SQL to specify exactly which field you mean.

By default, the lines are gray, but I like them to be bold and blue, so I toggle the "highlight" option in the diagram area. You might do the same.

ENUMs¶

When we define our pets, we can use the ENUM datatype for certain columns. For example, we might have a vet practice that handles cats, dogs, owls, rabbits and rats, as well as others:

create table pet (
    pid integer auto_increment,
    kind enum('cat','dog','other','owl','rabbit','rat'),
    ...
);

As an aside, that I put the kinds in alphabetical order.¹

We can also do that in dbdiagram.io:

vet office with inline enum

We can also define the ENUM type separately:

vet office with separate enum

Defining it separately is useful for two reasons:

if the enum is lengthy, it removes that clutter from the table definition, and
if the enum is used in several places, we can just use the name of the enum, like pet_kind above.

Notice that if if we hover over the pet_kind, the web app shows us the definition, which is very cool.

Exporting SQL¶

Let's export our diagram as MySQL syntax.

First, give the diagram a name by typing in the box at the upper left. I've named mine "Vet Office Simple". Next, in the area above the diagram, you'll see an icon that has a downward-pointing arrow that suggests downloading. Click that and choose "export to MySQL". That will download a file like Vet Office Simple.sql. Open that file and you'll see code like this:

CREATE TABLE `owner` (
  `oid` int PRIMARY KEY,
  `name` varchar(50)
);

CREATE TABLE `pet` (
  `pid` int PRIMARY KEY,
  `oid` int,
  `name` varchar(30)
);

ALTER TABLE `pet` ADD FOREIGN KEY (`oid`) REFERENCES `owner` (`oid`);

Most of this code you already understand. One thing is that the exported MySQL code puts backticks around all the identifiers. That's because it's a machine and doesn't mind a little superflous typing, while humans tend to be lazier and only use the backticks when necessary, as with the release column of the movie table, which happens to (now) be a reserved word in MySQL. So, we coded queries like:

-- Movies released in the 1990s
select title, director, `release` 
from movie 
where `release` like '199%';

While a very careful coder might write:

-- Movies released in the 1990s
select `title`, `director`, `release` 
from `movie` 
where `release` like '199%';

Don't let the extra, optional backticks bother you.

Notice how shorthands like [pk] turned into the proper syntax for MySQL. That's nice.

The exported SQL doesn't have the use statement at the top, or the drop table if exists syntax that we use to allow us to revise an existing table by replacing it. You can always edit the downloaded .sql file to add those if you'd like.

The ALTER TABLE syntax is new. The ALTER TABLE SQL statement allows us to change the definition of a table after it has been created. This is particularly useful when the table is full of data and we want to add a column or something like that. Here, the table isn't full of data (it's brand new), but they use the alter table syntax anyhow.

Here, we are adding a foreign key constraint that the DBMS will enforce: values of oid in the pet table must be the oid for some entity in the owner table. In other words, if you specify that the oid for "Fluffy" is 17, the 17 must identify some entity in the owner table, hopefully Hagrid. The DBMS can't ensure that you enter the correct value, but it can at least ensure that you enter a valid value.

Note that we would typically write the SQL like this:

CREATE TABLE `owner` (
  `oid` int PRIMARY KEY,
  `name` varchar(50)
);

CREATE TABLE `pet` (
  `pid` int PRIMARY KEY,
  `oid` int,
  `name` varchar(30)
  foreign key (oid) references owner(oid) -- FK constraint here
);

with the foreign key constraint listed in the referencing table, but either syntax is fine. The dbdiagram website uses the alter table syntax, and it works perfectly well.

Try loading the SQL code into MySQL! (You might need to add the use <db> and drop table if exists clauses.)

One-to-Many Versus Many-to-One¶

Notice that the line in our diagram has the 1 next to owner.oid and the * next to the pet.oid, creating a correct 1 owner to many pets mapping. The diagram software gets the correct direction because we defined the forign key like this:

[ref: > owner.oid]

If we had used < instead of >, it would be backwards:

[ref: < owner.oid]

Try it! Change the angle bracket direction and see what happens to the diagram.

Mnemonic: the angle bracket points to the one. The wide end is towards the many.

Defining One-to-Many Relationships by Dragging¶

You can also define the relationship by dragging:

Delete the [ref: > owner.oid] clause and see the line disappear.
drag from the owner.oid to the pet.oid
see the line appear, with the 1 and the asterisk, and the alter table syntax to define the constraint

You can get the opposite meaning by dragging the other way.

Many-to-Many Relationships¶

One of the important relationships in the WMDB is the many-to-many relationship between people (actors) and movies. A person can act in many movies and a movie can have many actors.

In the WMDB, we implemented that many-to-many relationship using the credit table. An entry in the credit table is a pair of foreign keys: an nm to specify the actor and a tt to specify the movie. An (nm,tt) pair like (123,120780) means that person 123 (George Clooney) acted in movie 120780 (Out of Sight).

One way to do that in dbdiagram is to define the three tables, exactly as you'd expect from WMDB.

Create a new diagram, say called "WMDB credit" and enter the following code:

table person {
    nm int [pk]
    name varchar(50)
}

table movie {
    tt int [pk]
    title varchar(60)
}

table credit {
    nm int [ref: > person.nm]
    tt int [ref: > movie.tt]
}

The result looks like this (once I rearranged the boxes a bit):

WMDB credit table

Notice the two one-to-many relationships that implement the many-to-many relationship.

The credit table makes sense (in my opinion) and has a reasonable name. Sometimes, though, we just want to focus on the many-to-many relationship, instead of thinking of it as two one-to-many relationships.

So, create a new diagram, maybe called "WMDB many-many" with code like this:

table person {
  nm int [pk]
  name varchar(40)
}

table movie {
  tt int [pk]
  title varchar(60)
}

Ref: "movie"."tt" <> "person"."nm"

You should see something like this diagram:

WMDB many-to-many

Notice the line with an asterisk at both ends, showing a many-to-many relationship.

Notice the syntax for the references between the two tables:

Ref: "movie"."tt" <> "person"."nm"

The <> means many-to-many, while < means 1-to-many and > means many-to-1.

Try exporting the SQL code. You might get something like this:

CREATE TABLE `person` (
  `nm` int PRIMARY KEY,
  `name` varchar(40)
);

CREATE TABLE `movie` (
  `tt` int PRIMARY KEY,
  `title` varchar(60)
);

CREATE TABLE `movie_person` (
  `movie_tt` int NOT NULL,
  `person_nm` int NOT NULL,
  PRIMARY KEY (`movie_tt`, `person_nm`)
);

ALTER TABLE `movie_person` ADD FOREIGN KEY (`movie_tt`) REFERENCES `movie` (`tt`);

ALTER TABLE `movie_person` ADD FOREIGN KEY (`person_nm`) REFERENCES `person` (`nm`);

Some important things to note:

The downloaded code automatically creates an intermediate table! In this case, the table is named movie_person, just combining the two table names.
The primary key of this intermediate table is the combination of the primary keys of both participating tables.
There are two foreign key constraints created.

This is just like our credit table!

This technique for many-to-many relationship is fine, and you are welcome to use it for your own ER diagrams. Two small drawbacks of this technique are that:

you are limited to exactly one such relationship, but in principle there could be several many-to-many relationships between two entities, and
you might want to put additional information in the intermediate table

For people and movies, where the people entity could include movie-goers and critics as well as actors, we can think of many-to-many relationships like:

person P likes movie M
person P rates movie M (the intermediate table probably has a rating attribute, maybe some number of stars)
person P acts in movie M (the intermediate credit table could have the name of the role/character, the salary, etc.)
person P has seen movie M (the intermediate table could even have a count: I've seen Shrek at least 8 times)

I'm sure you can come up with other examples.

One-to-One Relationships¶

One-to-one relationships are relatively rare in databases, because if we have a one-to-one relationship between two entities, we often would just combine the two into a single entity. Still, there are occasions when we might want two tables.

Imagine we have a user table, with some common attributes and some odd or unusual ones:

table user {
    uid int [pk]
    username varchar(20)
    password char(60)
    address varchar(100)
    favorite_color varchar(20)
    favorite_song varchar(20)
    favorite_movie varchar(20)
    favorite_food varchar(20)
    favorite_book varchar(20)
}

Suppose in our application we need the first three all the time, the address occasionally, and the favorites almost never. Why drag the less-used attributes around all the time when they are rarely needed? Instead, we could put them in another table and join with that table on those rare occasions when we need that additional stuff.

Create a new diagram and enter the following code:

table user {
    uid int [pk]
    username varchar(20)
    password char(60)
    address varchar(100)
}

table favorites {
    uid int [ref: - user.uid]
    favorite_color varchar(20)
    favorite_song varchar(20)
    favorite_movie varchar(20)
    favorite_food varchar(20)
    favorite_book varchar(20)
}

The result might look like this:

user favorites

Notice the 1 on both ends of the line, indicating a one-to-one relationship. As you've guessed, the - in the ref note is what does the trick:

table favorites {
    uid int [ref: - user.uid]

And the exported SQL code looks like this

CREATE TABLE `user` (
  `uid` int PRIMARY KEY,
  `username` varchar(20),
  `password` char(60),
  `address` varchar(100)
);

CREATE TABLE `favorites` (
  `uid` int PRIMARY KEY,
  `favorite_color` varchar(20),
  `favorite_song` varchar(20),
  `favorite_movie` varchar(20),
  `favorite_food` varchar(20),
  `favorite_book` varchar(20)
);

ALTER TABLE `favorites` ADD FOREIGN KEY (`uid`) REFERENCES `user` (`uid`);

ISA Hierarchies¶

Very often in the world, there are entities that have subtypes. The subtypes have all the attributes of the supertype, but maybe with some extra attributes or different relationships.

This subtype situation is pronounced as if it were the two-word phrase "is a" smushed together. So a "cat is a mammal" becomes "cat ISA mammal". Some additional examples:

An Employee ISA Person, as are Professors, Staff, and Students.
A Professor ISA Employee, as is a Staff worker.
An offcampus student ISA student

In some ER diagrams, ISA hierarchies are pictured using triangles:

ER diagram of ISA hierarchy

Since we are using dbdiagram.io, we won't be using triangles. Instead, we'll work at a lower level, implementing the ISA hierarchy using tables

We have two choices:

the subtypes can all be combined into one large table, with NULL columns for the irrelevant columns (so the dorm column is NULL for an off-campus student). Or,
the subtypes can be put in different tables that are dynamically combined using JOIN and a shared key.

The tradeoffs are between the time cost of joins and the space cost of storing lots of NULLs. We will not worry excessively about which to do in this class. Think about the issue, make your best guess, and go ahead.

Let's consider a small ISA hierarchy where there are two kinds of students: on-campus students, which have a dorm attribute, and off-campus students, which have a local-address attribute. We will also add a boolean is_on_campus attribute that all students have, though we could infer that information from whether the dorm is null or not.

Combining them all is trivial. We just enter the one table as shown below.

Notice that we defined the enums separately:

enum yes_no {
   no
   yes
} 

enum dorm_type {
   BAT
   BEB
   CAZ
   CED
   CER
}

table student {
    sid char(9) [pk, note: 'this is the b-number']
    name varchar(40)
    is_on_campus yes_no
    dorm dorm_type [note: 'null for off-campus students']
    local_address varchar(80) [note: 'null for on-campus students']
}

This is simple and easy. The only downside is extra columns and storing lots of nulls. But that's pretty minor.

Here's a screenshot of the dbdiagram. You can see again that hovering over an enum shows you the possible values, which is cool. Notice also how you can add notes to the table code! There's a lot more to DBML that we won't cover.

student isa combined

Here's the export SQL. Again, no surprises, but we can see what happens to the notes:

CREATE TABLE `student` (
  `sid` char(9) PRIMARY KEY COMMENT 'this is the b-number',
  `name` varchar(40),
  `is_on_campus` enum,
  `dorm` varchar(30) COMMENT 'null for off-campus students',
  `local_address` varchar(80) COMMENT 'null for on-campus students'
);

The alternative representation as multiple tables is almost as easy. We still probably want to have a boolean is_on_campus attribute so we know where to look for the additional information. Otherwise, it's what you'd expect.

enum yes_no {
   no
   yes
} 

enum dorm_type {
   BAT
   BEB
   CAZ
   CED
   CER
}

table student {
    sid char(9) [pk, note: 'this is the b-number']
    name varchar(40)
    is_on_campus yes_no
}

table on_campus_student {
    sid char(9) [pk, ref: - student.sid]
    dorm dorm_type
}

table off_campus_student {
    sid char(9) [pk, ref: - student.sid]
    local_address varchar(80)
}

We used two one-to-one relationships here, yielding the following diagram:

student isa separate tables

The SQL is like this:

CREATE TABLE `student` (
  `sid` char(9) PRIMARY KEY COMMENT 'this is the b-number',
  `name` varchar(40),
  `is_on_campus` ENUM ('no', 'yes')
);

CREATE TABLE `on_campus_student` (
  `sid` char(9) PRIMARY KEY,
  `dorm` ENUM ('BAT', 'BEB', 'CAZ', 'CED', 'CER')
);

CREATE TABLE `off_campus_student` (
  `sid` char(9) PRIMARY KEY,
  `local_address` varchar(80)
);

ALTER TABLE `on_campus_student` ADD FOREIGN KEY (`sid`) REFERENCES `student` (`sid`);

ALTER TABLE `off_campus_student` ADD FOREIGN KEY (`sid`) REFERENCES `student` (`sid`);

The downside here is that we have to do a join if we want the additional information, and we have to store the SID in each additional table, to enable the join. And if we don't often need the dorm/local-address information, we can usually skip the join. And, we don't have to store any nulls in the main table.

Auto Increment¶

One feature of MySQL (and many other DBMSs) is auto_increment fields. These are really useful for things like the oid (owner id) or pid (pet id) in a vet office example. You insert a new owner or new pet, and the database table assigns it a unique number (basically just incrementing an internal counter).

That feature is included in DBML. Just add increment to the column features, like this:

table owner {
    oid int [pk, increment]
    name varchar(50)
}

There are other column settings.

Conclusion¶

Having a picture is a nice way to understand and communicate the design of your database. Lines between entities make the connections clear and obvious, instead of requiring the reader to infer the relationships from foreign key constraints.

The dbdiagram.io website allows you to type in SQL-like syntax to define your entities and relationships, move the boxes around to yield a nice-looking layout, and export the result as a picture or as MySQL code.

Internally, enums like this are stored as small integers, so the first thing in the set (cat) is zero, the second thing (dog) is 1, and so forth. This is extremely space-efficient. However, when you sort the rows by the enum type (kind in this case), they are sorted by the numerical value, so if you want cat to preceed dog in the sorted output, put them in sorted order in the enum. ↩