Welcome!
Everything is fine.

Relational Algebra, Normalization and Views

Plan

  1. Announcements
  2. Discuss
  3. Recap Topics
  4. Quiz Questions
  5. Discuss Relational Algebra and do some exercises
  6. Breakout session activities

Announcements

  1. Grading status
  2. Schedule:
    • presenations on Tuesday
    • Senior Poster Fair next Friday, 3-4 in H302
  3. Project:
    • Alpha should be a fairly complete app.

Views, Relational Algebra and Normalization

Today, we'll discuss three somewhat related topics:

  • views
  • relational algebra
  • normalization

Views

A view is like a virtual table:

use cs304_db; 
drop view if exists filmography; 

create view filmography(nm,name,title) as 
select nm,name,title 
from person 
    inner join credit using (nm) 
    inner join movie using (tt); 

Discussion of Views

What are the pros and cons of a view versus creating a real filmography table?

  •  
  •  
  •  

Classic time/space tradeoff

  • real table saves time by not re-computing the query
  • view saves space by not storing data. Disk space is increasingly cheap, though.
  • real table has to keep the data updated, though. If the data changes slowly (actor filmographies) and is searched often, table can make sense.
  • maybe you update it as needed or less often than that, so real table might be out of date

Relational Algebra

What's the big deal?

a(b+c) vs ab+ac

  • algebraic rules let us transform one expression into another that means the same thing
  • relational algebra is the theoretical underpinning of relational databases
  • Thus, the DBMS can turn one query into another
  • The DBMS can estimate the run time of each such query, and
  • choose the fastest. This is called query optimization.

Tables are sets of tuples. Every expression denotes a set of tuples.

Notation

Naturally, relational algebra has its own notation:

  • select (sigma): σcriteria(R)
  • project (pi): πattributes(R)
  • union: R ∪ S
  • set difference: R - S
  • cross product: R × S
  • intersection: R ∩ S
  • join (bowtie): R ⋈condition S

Join can be defined by cross-product and selection but is typically defined as a primitive.

Union Compatible

Same number of attributes and corresponding ones are the same type.

Doesn't have to be sensible. Example:

Town sign adding population, elevation and year it was established. These rows are union-compatible, since they consist of a string and a number.

Quiz Questions

I'll address your quiz questions

Exercise on Relational Algebra

These all use WMDB relations.

Note that it's okay to name parts of the expression using something like variables.

  1. Names of people born on September 2, 1967.
     
    πname(  σbirthdate='1967-09-02'(Person)) 
    
  2. titles and years of all movies directed by Peter Jackson and released in the 1990s
     
    M90 = πtt(  σrelease='199%'(Movie))
    MPJ = πtt( σname='Peter Jackson'(Movie ⋈ nm=directorPerson))
    πtitle,year (Movie ⋈ (M90 ∩ MPJ ) )
  3. Titles of movies directed by someone under 50 and released after 2010.
     
    Mrecent = πtt(  σrelease>'2010'(Movie))
    MYoung = πtt( σbirthdate>1966(Movie ⋈ nm=directorPerson))
    πtitle (Movie ⋈ (Mrecent ∩ MYoung ) )
  4. Titles of movies not directed by someone under 50 and released after 2010.
     
    M2010 = πtt(  σrelease>'2010'(Movie))
    MYoung = πtt( σbirthdate>1966(Movie ⋈ nm=directorPerson))
    πtitle (Movie - Movie ⋈ (M2010 ∩ MYoung ) )

Query Optimization

What's the time complexity of each of the following? Discuss with elbow partner.

  • Looking up a person by NM
  • Looking up a person by NAME
  • Looking up a person by NAME if there's an index?
  • Should we have an index on name in the WMDB.person table?
  • Finding the intersection of two sets of tuples?
  • O(log N), and probably close to O(1)
  • O(N)
  • O(log N), and probably close to O(1)
  • We often look people up by name, so, yeah.
  • O(N log N), using the mergesort idea

Normalization

What's the point?

  • Avoiding redundancy lets us be frugal on space (less of an issue nowadays) but less space can also mean less time (less data to read from disk, transmit across networks, store in memory and so forth).
  • Avoiding redundancy prevents anomalies
  • Normalizing makes the database more flexible, because unanticipated queries are more possible.

Denormalization

Sometimes, it's better not to normalize, even to combine tables that were already separate. Criteria to consider:

  • normalization (decomposition) can makes complex queries slower because we have to do a join. A denormalized DB is pre-joined.
  • normalization can make simple queries faster because there's less data to move around, and more records can fit into a single disk block. Imagine trying to find a movie with a particular pattern of title (so you have to read every movie title), but the movie records also have big text blocks in them (a synopsis?).
  • normalization can make simple updates faster for the same reason.
  • normalization reduces storage, because it reduces duplication. For example, there's no reason to repeat all the information about a movie in each actor who appeared in the movie.
  • normalization can sometimes increase storage if the duplicate data is small (e.g. the name of a hobby) and the shared key is long (maybe an ID or email address).

The MongoDB version of WMDB is de-normalized

Ethics Discussion

Should we do it today or next Wednesday?

Today:

  • pro: some topics already suggested
  • pro: we are here today; next Wednesday is uncertain
  • con: less time pressure next Wednesday (maybe?)
  • con: maybe more topics next Wednesday
  • con: nicer classroom next Wednesday

Thoughts? Votes?

If today, here are the topics so far

Summary

  • Relational algebra is an important theoretical underpinning with immense practical value as well.
  • Normalization is a good idea most of the time, though not always.
  • Views are really useful.