
Normalization
Normalization is the idea of putting a database in standard normal form, which eliminates certain kinds of redundancies which can sometimes cause anomalies. There's a lot of formalism behind it, which can be confusing and daunting.
Reading¶
Please read the Wikipedia page on Database Normalization. It's fairly short (about 4 pages long, as of this writing). The following material (also fairly short), helps fill in a bit more about normalization.
The following notes are based on Chapter 6 of Kiefer, Bernstein and Lewis (KBL), through 6.6.
Idea: decompose (break up) relations using normal forms.
Goal: Eliminate problems caused by redundancy: update anomalies, insertion anomalies, and deletion anomalies.
Simple Example using People and Hobbies¶
What kind of redundancy are are we talking about? Consider a table that stores people and their hobbies.
CREATE TABLE People(SSN, Name, Addr, Hobby) PRIMARY KEY (SSN, Hobby);
Here's some sample data:
SSN | Name | Addr | Hobby |
---|---|---|---|
555 | Homer Simpson | 742 Evergreen Terrace, Springfield | Drinking |
555 | Homer Simpson | 742 Evergreen Terrace, Springfield | TeeVee |
555 | Homer Simpson | 742 Evergreen Terrace, Springfield | Eating |
666 | Bart Simpson | 742 Evergreen Terrace, Springfield | Mischief |
489 | Lisa Simpson | 742 Evergreen Terrace, Springfield | Reading |
489 | Lisa Simpson | 742 Evergreen Terrace, Springfield | Politics |
This has a lot of redundancy, because someone with 3 hobbies like Homer has three rows, so his address is stored three times.
What are other examples? Maybe Movie Genre? A movie could have a set
of genres associated with it, so the genre would have to be part of
the key. Or, even better, user-defined tags
, as is often done
nowadays.
Anomalies¶
Formally, the following kind of anomalies can arise:
update anomaly
Have to update all relevant tuples instead of just one. For example, if Homer's address changes, we have to update all the rows devoted to him, one for each hobby. All the relevant tuples would have to be locked to avoid race conditions.
(This is certainly why the UPDATE syntax allows us to update all matching rows in one query. Consider i-am-a-dummy mode of MySQL. )
insertion anomaly
What to do when inserting tuple without the extra data. For example, we have to insert Marge Simpson, but we don't (yet) have any information about her hobbies. What do we do?
- use null?
- But what if it's part of the primary key? (It typically will be: why?)
- What about subsequently added hobbies or genres? Should it replace the null, or just add a new row?
- Is there information we can't store? (E.g. suppose instead of
having a separate
customer
table, we put customer data in the rental table. Now suppose a customer doesn't currently have anything rented. What happens to the customer data?)
deletion anomaly
What do we do if we have to delete someone's hobby? When do we replace it with null because it's the last tuple for this person? Do we lose information when we delete a tuple? (Eg. delete the last student from a course.)
wasted storage
Redundant information has to be stored multiple times. Homer's address is stored three times in the example above.
Avoiding Anomalies¶
Most of these strike me as stuff no one would ever do. However, as the number of database tables and relationships goes up, it might well creep in. It's good to understand these conceptual problems and their (formal) solutions.
Solution: have separate relations with a 1-many mapping of hobbies for each person. Must be lossless. That means we can get all the information back by using a JOIN.
Extended Example using Invoices¶
I really like the normalization tutorial that Fred Coulson wrote because it starts with something that is fairly natural, rather than starting with something that is obviously flawed. (If that link is dead, try this Coulson Normalization Tutorial or this PDF of his Normalization Tutorial.)
I think he does a good job on this, so I won't spend class time reading it to you, but I'll be happy to address any questions. Do please take the time to read it.
This example will be somewhat complicated by the fact that we don't know what these normal forms are, but I think trying to do the formalism first is a recipe of confusion. Our goal will be to reduce redundancy.
Formalism¶
Coulson's tutorial mentions dependencies but mostly leaves the definition to your intuition. I think it's fine, but if you want more formalism, you can read more about functional dependencies.
Denormalization¶
Sometimes, it's better not to normalize, even to combine tables that were already separate. Criteria to consider about normalization:
- normalization makes complex queries slower, because we have to join
- normalization makes simple queries faster, because less data is moved around
- normalization makes simple updates faster, because less data is read and re-written
- normalization reduces storage, because redundancy is reduced
- normalization can sometimes increase storage, if the extra foreign keys that have to be stored outweigh the reduction in data storage
Many real-world databases are not normalized. So, normalization is a choice, not a requirement.