Welcome!
Everything is fine.

Relational Algebra, Normalization and Views

Plan
Announcements
Views, Relational Algebra and Normalization
Views
Discussion of Views
Relational Algebra
Notation
Union Compatible
Quiz Questions
Exercise on Relational Algebra
Query Optimization
Denormalization
Ethics Discussion
Summary

Plan¶

Announcements
Discuss
Recap Topics
Quiz Questions
Discuss Relational Algebra and do some exercises
Breakout session activities

Announcements¶

Grading status
Schedule:
- presenations on Tuesday
- Senior Poster Fair next Friday, 3-4 in H302
Project:
- Alpha should be a fairly complete app.

Views, Relational Algebra and Normalization¶

Today, we'll discuss three somewhat related topics:

views
relational algebra
normalization

Views¶

A view is like a virtual table:

use cs304_db; 
drop view if exists filmography; 

create view filmography(nm,name,title) as 
select nm,name,title 
from person 
    inner join credit using (nm) 
    inner join movie using (tt);

Discussion of Views¶

What are the pros and cons of a view versus creating a real filmography table?

Classic time/space tradeoff

real table saves time by not re-computing the query
view saves space by not storing data. Disk space is increasingly cheap, though.
real table has to keep the data updated, though. If the data changes slowly (actor filmographies) and is searched often, table can make sense.
maybe you update it as needed or less often than that, so real table might be out of date

Relational Algebra¶

What's the big deal?

a(b+c) vs ab+ac

algebraic rules let us transform one expression into another that means the same thing
relational algebra is the theoretical underpinning of relational databases
Thus, the DBMS can turn one query into another
The DBMS can estimate the run time of each such query, and
choose the fastest. This is called query optimization.

Tables are sets of tuples. Every expression denotes a set of tuples.

Notation¶

Naturally, relational algebra has its own notation:

select (sigma): σ_criteria(R)
project (pi): π_attributes(R)
union: R ∪ S
set difference: R - S
cross product: R × S
intersection: R ∩ S
join (bowtie): R ⋈_condition S

Join can be defined by cross-product and selection but is typically defined as a primitive.

Union Compatible¶

Same number of attributes and corresponding ones are the same type.

Doesn't have to be sensible. Example:

Town sign adding population, elevation and year it was established. These rows are *union-compatible*, since they consist of a string and a number.

Quiz Questions¶

I'll address your quiz questions

Exercise on Relational Algebra¶

These all use WMDB relations.

Note that it's okay to name parts of the expression using something like variables.

Names of people born on September 2, 1967.

 
π_name(  σ_{birthdate='1967-09-02'}(Person))

titles and years of all movies directed by Peter Jackson and released in the 1990s

 
M90 = π_tt(  σ_{release='199%'}(Movie))
 
MPJ = π_tt(  σ_{name='Peter Jackson'}(Movie ⋈ _nm=directorPerson))
 
π_title,year (Movie ⋈ (M90 ∩ MPJ ) )

Titles of movies directed by someone under 50 and released after 2010.

 
Mrecent = π_tt(  σ_{release>'2010'}(Movie))
 
MYoung = π_tt(  σ_{birthdate>1966}(Movie ⋈ _nm=directorPerson))
 
π_title (Movie ⋈ (Mrecent ∩ MYoung ) )

Titles of movies not directed by someone under 50 and released after 2010.

 
M2010 = π_tt(  σ_{release>'2010'}(Movie))
 
MYoung = π_tt(  σ_{birthdate>1966}(Movie ⋈ _nm=directorPerson))
 
π_title (Movie - Movie ⋈ (M2010 ∩ MYoung ) )

Query Optimization¶

What's the time complexity of each of the following? Discuss with elbow partner.

Looking up a person by NM
Looking up a person by NAME
Looking up a person by NAME if there's an index?
Should we have an index on name in the WMDB.person table?
Finding the intersection of two sets of tuples?

O(log N), and probably close to O(1)
O(N)
O(log N), and probably close to O(1)
We often look people up by name, so, yeah.
O(N log N), using the mergesort idea

Normalization

What's the point?

Avoiding redundancy lets us be frugal on space (less of an issue nowadays) but less space can also mean less time (less data to read from disk, transmit across networks, store in memory and so forth).
Avoiding redundancy prevents anomalies
Normalizing makes the database more flexible, because unanticipated queries are more possible.

Denormalization¶

Sometimes, it's better not to normalize, even to combine tables that were already separate. Criteria to consider:

normalization (decomposition) can makes complex queries slower because we have to do a join. A denormalized DB is pre-joined.
normalization can make simple queries faster because there's less data to move around, and more records can fit into a single disk block. Imagine trying to find a movie with a particular pattern of title (so you have to read every movie title), but the movie records also have big text blocks in them (a synopsis?).
normalization can make simple updates faster for the same reason.
normalization reduces storage, because it reduces duplication. For example, there's no reason to repeat all the information about a movie in each actor who appeared in the movie.
normalization can sometimes increase storage if the duplicate data is small (e.g. the name of a hobby) and the shared key is long (maybe an ID or email address).

The MongoDB version of WMDB is de-normalized

Ethics Discussion¶

Should we do it today or next Wednesday?

Today:

pro: some topics already suggested
pro: we are here today; next Wednesday is uncertain
con: less time pressure next Wednesday (maybe?)
con: maybe more topics next Wednesday
con: nicer classroom next Wednesday

Thoughts? Votes?

If today, here are the topics so far

Summary¶

Relational algebra is an important theoretical underpinning with immense practical value as well.
Normalization is a good idea most of the time, though not always.
Views are really useful.