Entity Relationships Diagrams¶
ER diagrams are a useful way to picture the entities that we are storing in our database (e.g. people, movies, pets, owners, albums, tracks) and the relationships among them. These are consequently called Entity Relationship Diagrams or ER Diagrams.
Diagrams are useful because people often think pictorially, and they find pictures helpful in clarifying how things interact. Furthermore, diagrams can helpful explaining the entities and relationships to non-technical people (such as managers) who can't read a line of SQL.
There are a variety of diagram techniques that are commonly used. This semester, Fall of 2022, we are going to switch to a new one. If you're interested in the old one, you're welcome to read more about the former ER diagrams. As always, there are advantages and disadvantages, but I think the advantages outweigh the disadvantages.
The reason for the change is that I've recently learned of a very cool website in which you can type in SQL code (similar, but not identical, to the SQL code we've been learning) and the website will draw the diagram for you! You can then export the diagram (as PNG or PDF), send someone a link to the page, or whatever.
Furthermore, you can export the SQL code that you typed in. You can even import SQL code you've already written, and it will create the diagram from it. You can even edit the SQL more. All in all, this is very exciting and I think it will make your life easier in CS 304.
The website is dbdiagram.io. You can create a free account just by giving them your email address. There's also a paid account if you need to keep your diagrams private. A free account will be sufficient for this course.
Entity-Relationship Model¶
Here's an ER diagram for the WMDB, using dbdiagram.io. Here's the direct link to WMDB diagram
You can see each box represents an entity. The box gives the name of the entity and a list of its attributes (fields, columns) and their datatypes. You also see lines connecting the boxes. Those lines represent relationships between the entities.
Let's first talk about the entities (boxes) and then the relationships (lines).
The person box represents, of course, that entity. In the
database, it'll be stored as a table. Notice that the nm
field is
bold. That's because it's the primary key of the table.
The movie box is similar. It has different fields, of course, but
the first field, tt
is bold, because it's the primary key.
Now let's turn to one of the lines.
There is a line connecting the director
field of the movie
entity with the nm
field of the person entity. That means that
the values of the director
field will be nm
values. For example,
if there's a movie where the director field is 17, that means that 17
is the nm
(the id) of some person, and the person with that nm
directed that movie. Thus, the line represents the "directs"
relationship between movies and people.
There's more information in the diagram for that line. At the
director
end of the line, there's an asterisk, while at the nm
end, there's the number 1. That means this is a one-to-many
relationship: a person can direct many movies, but a movie only has 1
director.
This representation takes some getting used to: we sometimes think of a star (asterisk) as meaning "many", and it does, but here it does not mean that a movie has many directors. Instead it means that the person-movie relationship is 1-many, with the 1 at the "person" end and the many ("*") at the "movie" end.
Take some time to review the picture. You'll notice that all of our relationships are 1-many. That's actually very common. Later, we'll learn about 1-1 and many-many relationships.
I hope you'll agree that the diagram is helpful and easier to read and understand than the SQL code.
Terminology Pedantry¶
We've talked a lot about entities and relationships. "Oppenheimer" is
an entity and so is Barbie, and they are both in the
Movie
table, which is an entity set. That's the
traditional terminology. So the boxes above are entity sets.
It's also traditional to use singular words for the names of the sets. That often makes queries easier to read:
-- movies directed by people born in 1961
select name, title
from person inner join movie on (person.nm = movie.director)
where year(person.birthdate) = 1961;
...
because the where
clause is talking about a single person
not
multiple people
.
Similarly, the connection between Greta Gerwig and Barbie is a relationship (Greta Gerwig directed "Barbie"). Also, Christopher Nolan directed "Oppenheimer"; that's another relationship. So the line between People and Movie represents a relationships set.
Thus, we can distinguish between a particular entity and an entity set and between a particular relationship and a relationship set. However, in practice, there are lots of contexts were the words are used less carefully, and we figure out what is meant by context.
Keys¶
Before we go further into ER diagrams, let's talk a bit about keys. Entities usually have keys. A key is
- a set of attributes that uniquely specifies an entity. E.g. SSN, or (FirstName,LastName,ClassYear)
- no subset of those attributes works
You'll notice that we say a set of attributes, so it might take more than one attribute to uniquely specify an entity. Most of the time, we'll have a single attribute, but later in this reading, we'll see examples of composite keys.
Other observations:
- The key should be unique by definition, not by happenstance. A person's name could be unique at Wellesley, until someone else with that name matriculates.
- There may be more than one key. Can you think of two or more keys for a Wellesley student?
- The database should enforce the key constraint. That is, the database should prevent someone from entering another student with the same B-number as an existing student.
Vet Office¶
Now let's create our first diagram, for a vet office. Recall that there were two entities: owners and pets. Let's keep it simple and just have a few attributes for each. You already know about attributes, and our focus right now is the ER diagram.
Log in to dbdiagram.io, and create a new diagram. You can name it whatever you like.
In the dark area to the left, type in the following:
table owner {
oid int [pk]
name varchar(50)
}
You'll see that a box is created with two fields, with oid
in bold
because it's the primary key (pk
). Excellent! Like this:
Note that the "code" we typed in doesn't match our SQL syntax. It is, in fact, Database Markup Language (DBML). See DBML.org. For example, the attributes are surrounded by { braces } instead of ( parentheses ). In many ways, the DBML syntax is simpler and easier, but the differences will be occasionally annoying or confusing. Fortunately the syntax highlighting in dbdiagram.io is quite good, so when I forget and type a comma at the end of the line, it will immediately alert me to my mistake, and it's easily fixed.
Now, let's create the pet
table. Type in the following:
table pet {
pid int [pk]
oid int [ref: > owner.oid]
name varchar(30)
}
This creates another box and connects the two with a line for the 1-to-many relationship. Perfect! You can drag the boxes around to get a nicer layout, maybe like this:
A syntax note: the syntax owner.oid
means the oid
field
(attribute) in the owner
table. That dot syntax is actually
standard: you can use the same syntax in SQL to specify exactly which
field you mean.
By default, the lines are gray, but I like them to be bold and blue, so I toggle the "highlight" option in the diagram area. You might do the same.
ENUMs¶
When we define our pets, we can use the ENUM datatype for certain columns. For example, we might have a vet practice that handles cats, dogs, owls, rabbits and rats, as well as others:
create table pet (
pid integer auto_increment,
kind enum('cat','dog','other','owl','rabbit','rat'),
...
);
As an aside, that I put the kinds in alphabetical order.1
We can also do that in dbdiagram.io
:
We can also define the ENUM type separately:
Defining it separately is useful for two reasons:
- if the enum is lengthy, it removes that clutter from the table definition, and
- if the enum is used in several places, we can just use the name of the enum, like
pet_kind
above.
Notice that if if we hover over the pet_kind
, the web app shows us
the definition, which is very cool.
Exporting SQL¶
Let's export our diagram as MySQL syntax.
First, give the diagram a name by typing in the box at the upper
left. I've named mine "Vet Office Simple". Next, in the area above the
diagram, you'll see an icon that has a downward-pointing arrow that
suggests downloading. Click that and choose "export to MySQL". That
will download a file like Vet Office Simple.sql
. Open that file and
you'll see code like this:
CREATE TABLE `owner` (
`oid` int PRIMARY KEY,
`name` varchar(50)
);
CREATE TABLE `pet` (
`pid` int PRIMARY KEY,
`oid` int,
`name` varchar(30)
);
ALTER TABLE `pet` ADD FOREIGN KEY (`oid`) REFERENCES `owner` (`oid`);
Most of this code you already understand. One thing is that the
exported MySQL code puts backticks around all the identifiers. That's
because it's a machine and doesn't mind a little superflous typing,
while humans tend to be lazier and only use the backticks when
necessary, as with the release
column of the movie
table, which
happens to (now) be a reserved word in MySQL. So, we coded queries like:
-- Movies released in the 1990s
select title, director, `release`
from movie
where `release` like '199%';
While a very careful coder might write:
-- Movies released in the 1990s
select `title`, `director`, `release`
from `movie`
where `release` like '199%';
Don't let the extra, optional backticks bother you.
Notice how shorthands like [pk]
turned into the proper syntax for
MySQL. That's nice.
The exported SQL doesn't have the use
statement at the top, or the
drop table if exists
syntax that we use to allow us to revise an
existing table by replacing it. You can always edit the downloaded
.sql
file to add those if you'd like.
The ALTER TABLE
syntax is new. The ALTER
TABLE SQL statement allows us
to change the definition of a table after it has been created. This is
particularly useful when the table is full of data and we want to add
a column or something like that. Here, the table isn't full of data
(it's brand new), but they use the alter table
syntax anyhow.
Here, we are adding a foreign key constraint that the DBMS will
enforce: values of oid
in the pet
table must be the oid
for
some entity in the owner
table. In other words, if you specify that
the oid
for "Fluffy" is 17, the 17 must identify some entity in the
owner
table, hopefully Hagrid. The DBMS can't ensure that you enter
the correct value, but it can at least ensure that you enter a valid
value.
Note that we would typically write the SQL like this:
CREATE TABLE `owner` (
`oid` int PRIMARY KEY,
`name` varchar(50)
);
CREATE TABLE `pet` (
`pid` int PRIMARY KEY,
`oid` int,
`name` varchar(30)
foreign key (oid) references owner(oid) -- FK constraint here
);
with the foreign key constraint listed in the referencing table, but
either syntax is fine. The dbdiagram website uses the alter table
syntax, and it works perfectly well.
Try loading the SQL code into MySQL! (You might need to add the use
<db>
and drop table if exists
clauses.)
One-to-Many Versus Many-to-One¶
Notice that the line in our diagram has the 1
next to owner.oid
and the *
next to the pet.oid
, creating a correct 1 owner to many
pets mapping. The diagram software gets the correct direction because
we defined the forign key like this:
[ref: > owner.oid]
If we had used <
instead of >
, it would be backwards:
[ref: < owner.oid]
Try it! Change the angle bracket direction and see what happens to the diagram.
Mnemonic: the angle bracket points to the one. The wide end is towards the many.
Defining One-to-Many Relationships by Dragging¶
You can also define the relationship by dragging:
- Delete the
[ref: > owner.oid]
clause and see the line disappear. - drag from the
owner.oid
to thepet.oid
- see the line appear, with the 1 and the asterisk, and the
alter table
syntax to define the constraint
You can get the opposite meaning by dragging the other way.
Many-to-Many Relationships¶
One of the important relationships in the WMDB is the many-to-many relationship between people (actors) and movies. A person can act in many movies and a movie can have many actors.
In the WMDB, we implemented that many-to-many relationship using the
credit
table. An entry in the credit
table is a pair of foreign
keys: an nm
to specify the actor and a tt
to specify the movie. An
(nm,tt)
pair like (123,120780)
means that person 123 (George
Clooney) acted in movie 120780 (Out of Sight).
One way to do that in dbdiagram is to define the three tables, exactly as you'd expect from WMDB.
Create a new diagram, say called "WMDB credit" and enter the following code:
table person {
nm int [pk]
name varchar(50)
}
table movie {
tt int [pk]
title varchar(60)
}
table credit {
nm int [ref: > person.nm]
tt int [ref: > movie.tt]
}
The result looks like this (once I rearranged the boxes a bit):
Notice the two one-to-many relationships that implement the many-to-many relationship.
The credit table makes sense (in my opinion) and has a reasonable name. Sometimes, though, we just want to focus on the many-to-many relationship, instead of thinking of it as two one-to-many relationships.
So, create a new diagram, maybe called "WMDB many-many" with code like this:
table person {
nm int [pk]
name varchar(40)
}
table movie {
tt int [pk]
title varchar(60)
}
Ref: "movie"."tt" <> "person"."nm"
You should see something like this diagram:
Notice the line with an asterisk at both ends, showing a many-to-many relationship.
Notice the syntax for the references between the two tables:
Ref: "movie"."tt" <> "person"."nm"
The <>
means many-to-many, while <
means 1-to-many and >
means many-to-1.
Try exporting the SQL code. You might get something like this:
CREATE TABLE `person` (
`nm` int PRIMARY KEY,
`name` varchar(40)
);
CREATE TABLE `movie` (
`tt` int PRIMARY KEY,
`title` varchar(60)
);
CREATE TABLE `movie_person` (
`movie_tt` int NOT NULL,
`person_nm` int NOT NULL,
PRIMARY KEY (`movie_tt`, `person_nm`)
);
ALTER TABLE `movie_person` ADD FOREIGN KEY (`movie_tt`) REFERENCES `movie` (`tt`);
ALTER TABLE `movie_person` ADD FOREIGN KEY (`person_nm`) REFERENCES `person` (`nm`);
Some important things to note:
- The downloaded code automatically creates an intermediate table! In
this case, the table is named
movie_person
, just combining the two table names. - The primary key of this intermediate table is the combination of the primary keys of both participating tables.
- There are two foreign key constraints created.
This is just like our credit
table!
This technique for many-to-many relationship is fine, and you are welcome to use it for your own ER diagrams. Two small drawbacks of this technique are that:
- you are limited to exactly one such relationship, but in principle there could be several many-to-many relationships between two entities, and
- you might want to put additional information in the intermediate table
For people and movies, where the people
entity could include
movie-goers and critics as well as actors, we can think of
many-to-many relationships like:
- person P likes movie M
- person P rates movie M (the intermediate table probably has a rating attribute, maybe some number of stars)
- person P acts in movie M (the intermediate
credit
table could have the name of the role/character, the salary, etc.) - person P has seen movie M (the intermediate table could even have a count: I've seen Shrek at least 8 times)
I'm sure you can come up with other examples.
One-to-One Relationships¶
One-to-one relationships are relatively rare in databases, because if we have a one-to-one relationship between two entities, we often would just combine the two into a single entity. Still, there are occasions when we might want two tables.
Imagine we have a user
table, with some common attributes and some
odd or unusual ones:
table user {
uid int [pk]
username varchar(20)
password char(60)
address varchar(100)
favorite_color varchar(20)
favorite_song varchar(20)
favorite_movie varchar(20)
favorite_food varchar(20)
favorite_book varchar(20)
}
Suppose in our application we need the first three all the time, the address occasionally, and the favorites almost never. Why drag the less-used attributes around all the time when they are rarely needed? Instead, we could put them in another table and join with that table on those rare occasions when we need that additional stuff.
Create a new diagram and enter the following code:
table user {
uid int [pk]
username varchar(20)
password char(60)
address varchar(100)
}
table favorites {
uid int [ref: - user.uid]
favorite_color varchar(20)
favorite_song varchar(20)
favorite_movie varchar(20)
favorite_food varchar(20)
favorite_book varchar(20)
}
The result might look like this:
Notice the 1
on both ends of the line, indicating a one-to-one
relationship. As you've guessed, the -
in the ref
note is what
does the trick:
table favorites {
uid int [ref: - user.uid]
And the exported SQL code looks like this
CREATE TABLE `user` (
`uid` int PRIMARY KEY,
`username` varchar(20),
`password` char(60),
`address` varchar(100)
);
CREATE TABLE `favorites` (
`uid` int PRIMARY KEY,
`favorite_color` varchar(20),
`favorite_song` varchar(20),
`favorite_movie` varchar(20),
`favorite_food` varchar(20),
`favorite_book` varchar(20)
);
ALTER TABLE `favorites` ADD FOREIGN KEY (`uid`) REFERENCES `user` (`uid`);
ISA Hierarchies¶
Very often in the world, there are entities that have subtypes. The subtypes have all the attributes of the supertype, but maybe with some extra attributes or different relationships.
This subtype situation is pronounced as if it were the two-word phrase "is a" smushed together. So a "cat is a mammal" becomes "cat ISA mammal". Some additional examples:
- An Employee ISA Person, as are Professors, Staff, and Students.
- A Professor ISA Employee, as is a Staff worker.
- An offcampus student ISA student
In some ER diagrams, ISA hierarchies are pictured using triangles:
Since we are using dbdiagram.io, we won't be using triangles. Instead, we'll work at a lower level, implementing the ISA hierarchy using tables
We have two choices:
- the subtypes can all be combined into one large table, with NULL
columns for the irrelevant columns (so the
dorm
column is NULL for an off-campus student). Or, - the subtypes can be put in different tables that are dynamically combined using JOIN and a shared key.
The tradeoffs are between the time cost of joins and the space cost of storing lots of NULLs. We will not worry excessively about which to do in this class. Think about the issue, make your best guess, and go ahead.
Let's consider a small ISA hierarchy where there are two kinds of
students: on-campus students, which have a dorm
attribute, and
off-campus students, which have a local-address
attribute. We will
also add a boolean is_on_campus
attribute that all students have,
though we could infer that information from whether the dorm is null
or not.
Combining them all is trivial. We just enter the one table as shown below.
Notice that we defined the enums separately:
enum yes_no {
no
yes
}
enum dorm_type {
BAT
BEB
CAZ
CED
CER
}
table student {
sid char(9) [pk, note: 'this is the b-number']
name varchar(40)
is_on_campus yes_no
dorm dorm_type [note: 'null for off-campus students']
local_address varchar(80) [note: 'null for on-campus students']
}
This is simple and easy. The only downside is extra columns and storing lots of nulls. But that's pretty minor.
Here's a screenshot of the dbdiagram. You can see again that hovering over an enum shows you the possible values, which is cool. Notice also how you can add notes to the table code! There's a lot more to DBML that we won't cover.
Here's the export SQL. Again, no surprises, but we can see what happens to the notes:
CREATE TABLE `student` (
`sid` char(9) PRIMARY KEY COMMENT 'this is the b-number',
`name` varchar(40),
`is_on_campus` enum,
`dorm` varchar(30) COMMENT 'null for off-campus students',
`local_address` varchar(80) COMMENT 'null for on-campus students'
);
The alternative representation as multiple tables is almost as
easy. We still probably want to have a boolean is_on_campus
attribute so we know where to look for the additional
information. Otherwise, it's what you'd expect.
enum yes_no {
no
yes
}
enum dorm_type {
BAT
BEB
CAZ
CED
CER
}
table student {
sid char(9) [pk, note: 'this is the b-number']
name varchar(40)
is_on_campus yes_no
}
table on_campus_student {
sid char(9) [pk, ref: - student.sid]
dorm dorm_type
}
table off_campus_student {
sid char(9) [pk, ref: - student.sid]
local_address varchar(80)
}
We used two one-to-one relationships here, yielding the following diagram:
The SQL is like this:
CREATE TABLE `student` (
`sid` char(9) PRIMARY KEY COMMENT 'this is the b-number',
`name` varchar(40),
`is_on_campus` ENUM ('no', 'yes')
);
CREATE TABLE `on_campus_student` (
`sid` char(9) PRIMARY KEY,
`dorm` ENUM ('BAT', 'BEB', 'CAZ', 'CED', 'CER')
);
CREATE TABLE `off_campus_student` (
`sid` char(9) PRIMARY KEY,
`local_address` varchar(80)
);
ALTER TABLE `on_campus_student` ADD FOREIGN KEY (`sid`) REFERENCES `student` (`sid`);
ALTER TABLE `off_campus_student` ADD FOREIGN KEY (`sid`) REFERENCES `student` (`sid`);
The downside here is that we have to do a join if we want the additional information, and we have to store the SID in each additional table, to enable the join. And if we don't often need the dorm/local-address information, we can usually skip the join. And, we don't have to store any nulls in the main table.
Auto Increment¶
One feature of MySQL (and many other DBMSs) is auto_increment
fields. These are really useful for things like the oid
(owner id)
or pid
(pet id) in a vet office example. You insert a new owner or
new pet, and the database table assigns it a unique number (basically
just incrementing an internal counter).
That feature is included in DBML. Just add increment
to the column
features, like this:
table owner {
oid int [pk, increment]
name varchar(50)
}
There are other column settings.
Conclusion¶
Having a picture is a nice way to understand and communicate the
design of your database. Lines between entities make the connections
clear and obvious, instead of requiring the reader to infer the
relationships from foreign key
constraints.
The dbdiagram.io website allows you to type in SQL-like syntax to define your entities and relationships, move the boxes around to yield a nice-looking layout, and export the result as a picture or as MySQL code.
-
Internally, enums like this are stored as small integers, so the first thing in the set (cat) is zero, the second thing (dog) is 1, and so forth. This is extremely space-efficient. However, when you sort the rows by the enum type (
kind
in this case), they are sorted by the numerical value, so if you wantcat
to preceeddog
in the sorted output, put them in sorted order in the enum. ↩