DDL: Data Definition Language

In this short reading, we'll see something about how to create a table, insert data into it, update the data, and delete it. In short, we'll learn how to create a database and store stuff in it.

Creating a table uses the "Data Definition Language" or DDL part of SQL. Other parts of SQL are known as the Data Manipulation Language or DML:

We usually won't make a distinction, and we'll call it all "SQL." In particular, formally a "query" is just SELECT statements, but informally, people will call INSERT and such a "query." Still, it's nice to know the formal terms.

Create Table

To create a table, we need to define:

  1. The name of the table
  2. one or more columns
  3. the names and datatypes of each column
  4. a primary key for the table. This isn't strictly necessary, but is very common, and useful for *joins

Primary Keys

A primary key is a column (or set of columns) that uniquely identfies a row. In the IMDB and the WMDB, there may possibly be more than one "George Clooney," but there cannot be more than one row with NM=123. NM is the primary key.

Example: Person Table

Let's start with a simple example:

CREATE TABLE person ( 
    nm int, 
    name varchar(30), 
    birthdate date, 
    addedby int, 
    primary key (nm) 
); 
  • The name of the table is person. Tables are typically given singular names, since each row is a person.
  • there are four columns: nm, name, birthdate and addedby
  • the "nm" is an integer: a unique identifier, like the IMDB has 123 as George Clooney's ID. That's important, in case there is another George Clooney.
  • The name is a variable-length string of characters up to 30 characters long: varchar(30).
  • The birthdate is a date, a special datatype in MySQL.
  • the addedby is an integer.
  • The primary key is the "nm" field, which means that (1) nm will uniquely identify a row, and (2) we will mostly be looking up records by "nm". The database arranges to make lookups by the primary key fast and efficient.

General Syntax

In general, the syntax is

CREATE TABLE tablename ( 
    col1name datatype, 
    col2name datatype, 
    ... 
    colname datatype, 
    PRIMARY KEY (col) 
) 

Re-Creating a Table

We usually define a table in a batch file, because (1) it's long and probably requires several attempts before we get it right, and (2) allows us to document assumptions and such.

When we have to revise our table (in our several attempts to get it right), we have to get rid of the old version before we can re-create. That's because MySQL won't allow you to create two tables with the same name. For example, if our batch just contains:

create table foo (
   a int,
   c varchar(30)  -- up to 30 characters
);

If we load that once, then decide that the 30 should really be 40, we can do:

  1. edit the batch file
  2. save it
  3. re-run it

But, in step 3, we'll get an error, because table foo already exists.

This is easily fixed. We just put a statement in our batch file that drops the table if it already exists. So, our batch file should be:

drop table if exists foo;

create table foo (
   a int,
   c varchar(30)  -- up to 30 characters
);

I always do my batch files for creating tables this way.

Datatypes

Datatypes are important because they determine

  • the amount of storage take up; it should be sufficient but not wasteful
  • the valid values, so "foo" is okay as a string but not as an integer
  • the meaning of operations (< the less-than operator, means different things for strings and integers and dates).

Simple Datatypes

  1. int for whole number things, like number of students. Can be negative.
  2. float for floating point numbers, like GPA.
  3. char for fixed-length character strings. two-letter state abbreviations might be char(2)
  4. varchar for character strings that are up to some limit, but might often be shorter.
  5. date for dates (like 1969-07-20), time for times (like 20:17), and datetime for both (like 1969-07-20 20:17).

Note the weird format for dates. That's the international standard. See XKCD on dates

Insert Data

Inserting a row is fairly straightforward: you list the columns you are specifying and then a tuple of their values:

INSERT INTO person(nm,name,birthdate,addedby) 
VALUES (123,'George Clooney','1961-05-06',1); 

You don't have to specify every value, so if you don't know his birthdate, you can do:

INSERT INTO person(nm,name,addedby) 
VALUES (123,'George Clooney',1) 

In that case, a NULL is used for his birthdate.

UPDATE Statement

To update a row, or a set of rows, we use an UPDATE statement with a WHERE clause. If we later learn that George Clooney was born on May 6th, 1961, we do this:

UPDATE person SET birthdate = '1961-05-06' 
WHERE nm = 123; 

What if we did this, instead?

UPDATE person SET birthdate = '1961-05-06' 
WHERE name = 'George Clooney'; 

That would update the birthdate of everyone who is named George Clooney, and there might be more than one!

In an UPDATE statement, every row that matches the boolean expression in the WHERE clause is updated.

There's even a mode for the MySQL client where it will prevent you from doing updates without using a key. The mode is called, believe it or not, i-am-a-dummy

DELETE statement

Deleting is similar to updating, because it uses a WHERE clause, and again, multiple matches are all deleted. So the following deletes the George Clooney we all know from the movies:

DELETE FROM person 
WHERE nm = 123; 

and the following deletes everyone with that name:

DELETE FROM person 
WHERE name = 'George Clooney'; 

Warning! MySQL will delete all matches and there is no "undelete". So the following typo, where we accidentally forgot the WHERE clause, will have catastrophic effect, because it will delete every row from the table:

DELETE FROM person;

This is like the Unix rm command.

Datatype Affects Meaning

Earlier, I said datatypes change the meaning of various operations. This is true in many languages.

In JavaScript, for example, the meaning of an expression like a+b depends on the datatypes of variables a and b. If they are both numbers, the meaning is add. If either of them is a string, the meaning is concatenate.

A cool effect of the MySQL interpreter (shell) is that you can try different expressions, just like we do in the Python interpreter. Here's an example:

MariaDB [cs304_db]> select 100 + 21;
+----------+
| 100 + 21 |
+----------+
|      121 |
+----------+
1 row in set (0.01 sec)

MariaDB [cs304_db]> select 100 < 21;
+----------+
| 100 < 21 |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

MariaDB [cs304_db]> select '100' < '21';
+--------------+
| '100' < '21' |
+--------------+
|            1 |
+--------------+
1 row in set (0.00 sec)

MariaDB [cs304_db]> 

I tried three expressions there. The first shows that we can use MySQL as a calculator, just as we do with Python. Note that the result is formatted like all SELECT statements, with a column name created from the expression, and the value shown in the cell below.

The second expression shows we can compute boolean expressions. Not surprisingly, the return value of 0 (zero) means false.

The third expression shows that we can compute boolean expressions on strings as well as on numbers. The return value of 1 (one) means true.

So 100 < 29 is false but '100' < '29' is true. The difference is the datatype.

In the cs304_db.cgm table, finding the min() value of mgdl found 100, since the mgdl column is varchar(4) and 100 is the earliest string, lexicographically. ("Lexicographically" is jargon for "dictionary order".)

In the cs304_db.cgm2 table, finding the min() value of mgdl found 29, since that column is now an integer, and 29 is the smallest number.

We can see the difference in the result of the describe command:

MariaDB [cs304_db]> describe cgm;
+---------+-------------+------+-----+---------+----------------+
| Field   | Type        | Null | Key | Default | Extra          |
+---------+-------------+------+-----+---------+----------------+
| user    | varchar(20) | NO   |     | NULL    |                |
| date    | varchar(29) | NO   |     | NULL    |                |
| mgdl    | varchar(4)  | NO   |     | NULL    |                |
| rec_num | int(10)     | NO   | PRI | NULL    | auto_increment |
+---------+-------------+------+-----+---------+----------------+
4 rows in set (0.01 sec)

MariaDB [cs304_db]> describe cgm2;
+---------+--------------+------+-----+---------+-------+
| Field   | Type         | Null | Key | Default | Extra |
+---------+--------------+------+-----+---------+-------+
| date    | datetime     | YES  | MUL | NULL    |       |
| mgdl    | mediumint(9) | YES  |     | NULL    |       |
| rec_num | int(11)      | NO   | PRI | 0       |       |
+---------+--------------+------+-----+---------+-------+
3 rows in set (0.00 sec)

Integer Datatypes and Space

In the last section, we saw that in cs304_db.cgm2 the datatype was mediumint(9). That datatype indirectly tells the database how much space (in bytes of storage on disk) to use for storing and representing that number. (Ignore the (9); we'll only discuss the mediumint part.)

If you've taken CS 240, you know that data is represented with bytes of storage. We don't have time to go into it too much, but the more bytes you are willing to store, the more possible values you can have. Hence, the larger the largest value can be (and, sometimes, the smaller the smallest).

Let's be very concrete. First, there are two kinds of numbers: signed and unsigned. An unsigned number is never negative. Negative values just can't be represented. That can be useful sometimes. If you specify someone's age as an unsigned integer, and a typo means a user tries to enter their age as "-4", they will get an error instead of success.

Here's an example of the storage requirements for unsigned integers and the resulting range:

type storage (bytes) min max
tinyint 1 0 255
smallint 2 0 65535
mediumint 3 0 16777215
int 4 0 4294967295
bigint 8 0 2^64 -1

With signed integers, we devote one bit to the sign bit, sacrificing about half of our max value, but greatly expanding our min.

You can get the full details about mysql integer types

By and large, you should not worry about these. If you're concerned, err on the side of allowing plenty of space.

However, it's always smart to be frugal on space. Why? Disks are gargantuan nowadays, and even a table of a million rows only saves 1MB of data by switching one numeric column from mediumint to smallint. Pretty small potatoes.

But 1MB is not negligible when we are pulling that data into memory, sorting/searching it, and so forth. The more data is moved around, the more time it takes. Again, memory is big nowadays, and processors are fast, so you should not be alarmed about space. These effects are relatively small.

But small is not zero.

So, I expect you in your assignments and projects to be thoughtful when you design your tables. But please don't worry about signed and unsigned integers; nothing in this will depend on that difference.

Datatypes: char and varchar

The char() datatype is for fixed-length character data. For example, a B-number is always the same length: a B and 8 digits. Storing that value requires exactly 9 characters, so char(9).

  • char(8) would not work; the last digit would not be stored.
  • char(10) would waste one byte in every row.

In general, fixed-length data is faster to access, so you should use char when it is reasonable to do so.

Other kinds of strings are variable length. That is, they are usually short, but occasionally long. For example, an email address is usually short: something like sanderso@wellesley.edu is 22 characters long. scott.anderson@wellesley.edu is 28 characters long. But they are often longer. Occasionally, people get silly and have very long email addresses. (The longest address on the A-Word-A-Day mailing list is 251 characters: AWAD stats)

Do we disallow long email addresses? Truncate them so they don't work? Do we use char(30) and to heck with people with longer addresses? Do we use char(251) and waste space for almost every email address?

An alternative is to use varchar. Say varchar(251) for an email address. That allows it to be up to 251 characters, but saves space in the usual case when it's shorter.

You should use char() and varchar() appropriately, showing reasonable thoughtfulness.

Note that int is for numbers that we might do arithmetic with. Numbers like zip codes and social security numbers would typically be char(5) and char(9), because we won't do arithmetic on them, and sorting them as strings is perfectly reasonable.

Plus, in Canada and lots of other places, zip codes include letters as well as numbers.

Dates and Times

MySQL has several useful fourth-dimension types:

  • date is like 2018-08-31
  • time is like 13:01:30
  • datetime is like 2018-08-31 13:01:30
  • can be parsed and formatted in many ways

Enum Types

Sometimes, you want to enforce that an element can only be one of a small, enumerated set of possible values. For example, our database of pets might have

create table pet (
    species enum('cat', 'dog', 'rabbit'),
    sex enum('male', 'female'),
    neutered enum('yes','no'),
    ...
);

These are treated as strings in queries, but the INSERT statements will give errors if the value is not one of the possible values.

Internally, the values are represented as numbers using 1 or 2 bytes, so ENUM is also very frugal on space, compared to the equivalent varchar. For example, the species enum above would represent cat as 0 (zero), dog as 1 (one) and rabbit as 2.

There are two side-effects of this representation. One is that you can assign values as numbers if you want, thought that's unlikely. The other is that they sort using their numerical values; equivalently, they sort the order they are given in the enum list. So, if I want the "cat" pets to precede the "dog" pets when I order by species, I need to make sure the enum list is in sorted order.

Set Types

ENUM is the right type for mutually exclusive values. The pet is either neutered or not, but never both.

Sometimes you want zero or more values from a set of possible values. For example, our pets could have some problematic behaviors:

create table pet (
    ...
    behaviors set('bites', 'scratches', 'hisses', 'howls')
    ...
);

I had a problematic cat once:

insert into pet1 values ('sushi','cat','male','yes','bites,scratches');

update pet1 set behaviors = 'bites,scratches,hisses' where name = 'sushi';

Like ENUM, the SET datatype has a compact internal representation: 1, 2, 3, 4, or 8 bytes, depending on the number of set members (64 members maximum).

Remember: the syntax of ENUM and SET look the same, but the semantic difference between them is that:

  • ENUM is exactly 1 from the list, and
  • SET is 0 or more from the list

Key Constraint

As we know from joins, keys are useful as "pointers" to data in another table. They give us unique identifiers for a row in a table.

A table can have more than one key. A table of Wellesley students might have Bnumber, email address, and other keys. Often those are dictated by the data.

Generally, a table should specify a primary key. If you don't, then MySQL will allow you to insert multiple rows, even ones that are identical to existing rows. That sometimes is what you want, but often not. To forbid duplicate values and ensure that an intended key is really a key, declare it so.

The following demonstrates the effect I mean:

use webdb;

drop table if exists key1;
create table key1 (
    a int,
    c varchar(40)
);

drop table if exists key2;
create table key2 (
    a int,
    c varchar(40),
    primary key (a)
);

-- two identical rows in key1. This works.
insert into key1(a,c) values (1,'Now is the winter of our discontent');
insert into key1(a,c) values (1,'Now is the winter of our discontent');

select * from key1;

-- two identical rows in key2. This fails on the second insert
insert into key2(a,c) values (1,'Now is the winter of our discontent');
insert into key2(a,c) values (1,'Now is the winter of our discontent');

select * from key2;

Here's the behavior:

$ mysql < multiple-rows.sql 
a   c
1   Now is the winter of our discontent
1   Now is the winter of our discontent
ERROR 1062 (23000) at line 24: Duplicate entry '1' for key 'PRIMARY'

So the first table worked fine and inserted two rows, but the second did not.

Very often, the second is the behavior we want.

Primary Keys

Just a bit more about primary keys.

  • Database tables are organized so that searching for rows based on the primary key is guaranteed to be as fast as possible.
  • but, primary key values must be unique, like SSN
  • Duplicate key errors are a good thing: you don't want data entry errors to cause your database to have bad data in it.

Names as Primary Keys

It's so tempting to make a person's name be a primary key. After all, that's what names are for: to say who we are talking about or addressing.

However, people's names make terrible primary keys, because names are not unique. (This varies by culture, but it is certainly the case in the United States.)

I could tell you many stories about my name.... I'll tell one or two in class.

Auto Increment

In the WMDB, we are (trying) to use the IMDB values for primary keys. But there are many times when you have a database where you need/want to have a primary key and none is handy.

Indeed, in the WMDB, when we insert a staff member -- specifically, when each of you inserts yourself as a staff member -- what unique UID should you have?

Wouldn't it be great if the table kept a hidden counter and just issued the next number as the UID, like taking a number at a bakery.

It does. It's called auto_increment. Here's how it works in the wmdb.staff table:

create table staff (
    uid int not null auto_increment,
    name varchar(30),
    primary key (uid),
    )

So, if "Emily Nguyen" inserts themself into the table and there happens to have been a prior "Emily Nguyen", the insert works, but the new "Emily Nguyen" has a differen UID from the earlier one.

Inserting Data with Auto Increment

How do you insert data into a table that has an auto_increment column? There are two choices:

  • You can override the auto_increment behavior and just specify a value, or
  • You can use the auto_increment behavior by either:
    • omitting the column in the INSERT statement, or
    • providing a value of either NULL or 0 for the column value

Here are examples of all of those.

If we want to insert 'Emily' into the Staff table, and we know that their UID is 234, we can do this:

insert into staff(uid,name) values (234, 'Emily');

Note that that is pretty rare. First, we run the risk that there already is or will be someone with UID 234, so we would get a primary key conflict. In general, if we have an auto_increment column, we should be using it as we intended. So, let's see how to do that.

All of the following are equivalent, inserting Emily and allowing MySQL to assign them the next UID for that table:

insert into staff(name) values ('Emily');
insert into staff(uid,name) values (NULL, 'Emily');
insert into staff(uid,name) values (0, 'Emily');

Finding an Auto_Increment value with last_insert_id()

You are probably wondering how to determine the UID that Emily got. One tempting way is the following:

select max(UID) from staff;

That will mostly work, but would fail if someone else is inserted into the staff table (maybe by some other CS 304 student) after we inserted Emily and before we computed the max(uid). Rare, but possible. We'll discuss this more later in the course when we discuss concurrency.

However, our MySQL connection keeps track of the last auto_increment value, and we can ask for it with the last_insert_id function:

select last_insert_id();

mysqldump

  • Not everything is done with batch files:
  • We could update rows using the MySQL shell.
  • We will have an online application that updates our database via the web.

So, if we need to save and restore a database, the state of the database is not in any file.

What do I mean by the state of a database? It means

  • all the create table statements
  • all the data in all the tables

If we insert, update, and delete some rows, we have changed the state of the database.

But

  • what if we want to "undo" some updates?
  • what if we want to save our state, in case we want to "undo"?
  • what if we want to transfer that state to another database?
  • how do we restore the database to some prior state?

All of these are quite common.

If you've ever looked at the "version history" of a Google Doc, you have seen this idea.

If you've ever used GIT, you've encountered this idea.

The tool to save (or "dump") the state of a database to a file is a command called mysqldump. Here's how to use it:

mysqldump mydb > dump_file.sql

For example, if I'm logged into the cs304guest account, and I'm about to do something I'm nervous about, I can dump the state of the database to a file. I often name the file something memorable, like this:

mysqldump cs304guest_db > before-datatype-change.sql

The contents of the dump is a batch file that contains all the drop table, create table and insert statements that are necessary to re-create the state of the database to that point.

So, if my experiment to change the datatype is a fiasco, I can restore my earlier version like this:

mysql < before-datatype-change.sql

(I dump the state of the databases nightly, so there is a safety net that will allow you to restore your database to an earlier point in time, if you mess up.)

Create Table Examples

We saw one example of the person table earlier in this reading. Here are a couple more.

Here's the Movie table from the WMDB:

CREATE TABLE movie ( 
    tt int, 
    title varchar(60), 
    `release` char(4), 
    director int,
    addedby int, 
    primary key (tt) 
); 

Here are some from an imaginary database with employees and departments. Don't worry about the decimal datatype, though you're welcome to use it if you want. decimal allows you do to exact calculations with numbers with a decimal point, like money.

drop table if exists employee;
drop table if exists department;

CREATE TABLE department (
       did integer auto_increment,
       dept_name varchar(50),
       chair integer, -- empno of current chair
       -- MySQL has special datatypes that Python doesn't
       -- this maxes out at 9,999,999.99: 9 digits w/2 past the decimal
       budget decimal(9,2),
       primary key (did)
);      

CREATE TABLE employee ( 
    empno integer auto_increment,
    name varchar(30), 
    hire_date date,
    did integer,  -- did of department
    last_login datetime,
    salary decimal(8,2),
    -- exempt employees are not eligible for overtime
    exempt enum('yes','no'),
    -- HR needs to track race/ethnicity for affirmative action reasons
    -- but employee can omit. Uses SET because can be multi-racial
    race set('white','black','hispanic','asian','pacific-islander',
        'native american','other','omitted'),
    primary key (empno) 
); 

Summary

  • Create a MySQL table by defining the table name, column names and types, and a primary key
  • primary key requires uniqueness but guarantees speed
  • Datatypes like int, char, varchar, date
  • Datatypes determine operations and storage costs
  • insert into table(col1,col2) values (val1,val2)
  • update table set col1 = val1, col2 = val2 where expression
  • delete from table where expression
  • mysqldump database > dumpfile.sql