
Functional Dependencies¶
An important aspect of normalization is ensuring that the functional dependencies are such that the attributes depend on the key, the whole key, and nothing but the key (see 3NF, below). This is unfortunately a bit abstract, but that's because normalization is based on some formal mathematics.
First, some notation:
- A is an attribute, like SSN or Name
- X is a set of attributes, like {SSN, Name, Age}
- AB is a set of the listed attributes, like {SSN, Name}
- XY is a union of the sets: X ∪ Y
A Functional Dependency (X → Y) is where knowing X means you know Y. Intuitively, the attribute Y is implicit in X and predictable from X. That is, as a practical manner, if you know X, you can infer Y.
Terminology¶
See your textbook or the Wikipedia pages for the longer, more exact definitions:
- Superkey: uniquely identifies a row
- Candidate key: minimal superkey.
- Primary key: exactly what you think it is.
- Non-prime attribute: An attribute not found in any candidate key. For example, someone's zip code.
Normal Forms¶
(This is drawn from KBL section 6.5.) There's also information in the wikipedia page.
- 1NF - relational model. That is, this is not much constraint at all. A model in 1NF is represented by tables of attributes. There are no set-valued attributes. There are no repeating elements or groups of elements. Wikipedia's example (Customer ID, First Name, Surname, Telephone number) of first normal form is good.
- 2NF - No FDs from a sub-key to a non-prime attribute. If the key is only one attribute, 2NF is trivial, since there's no such thing as a sub-key. Wikipedia's example (employee, skill, location) of second normal form is good.
- 3NF - good, not perfect. No non-trivial FDs among non-prime attributes. Again, Wikipedia's example (Tournament,Year,Winner,Winner DOB) of third normal form is good.
- BCNF - good, possibly costly. Every non-trivial FD must depend on a superkey. Wikipedia's example (Court,start time, end time ,rate type) of Boyce-Code normal form at Wikipedia is complicated but good.
3NF¶
In practice, with normalization, we typically want our database to be in third normal form, 3NF. Our mantra is:
Attributes depend on the key, the whole key, and nothing but the key
At least one of the following holds for every FD X → A:
- A ∈ X, or
- X is a superkey, or
- A ∈ some key
The third item is somewhat inscrutable, but is dropped in BCNF. (The
above definition is equivalent to every non-prime attribute depends
on a key
.)
BCNF¶
Either one or the other of the following holds for every FD X → A:
- A ∈ X (trivial dependency)
- X is a superkey
In other words, you need a key to know anything you don't already know. All redundancy is gone.
Properties¶
BCNF seems ideal, but maintenance is sometimes better with 3NF, because some integrity checks would require a join in BCNF but not in 3NF. Example, if age less than 21, hobbies can't include drinking, but we'd need to join with the person table to check the age.
Losslessness¶
All of our modifications must be lossless. That means we must be able to re-construct the original relation using joins.
Note, if we take a relation r and divide it into two sets of columns, r1 and r2, then
r ⊆ r1 ⋈ r2...
is always true. Why? Because the information is in the selection criteria. Without that, you get a cross product. We're worried about losing information, not tuples.
A binary decomposition is lossless iff a tuple from one relation maps to only one tuple in the other: 1-many or many-1.