Transaction Processing

Overview:

  • Introduce some terminology, then
  • look at some examples of transactions in real tasks

At a high level of abstraction, operations on the database consist of read operations (e.g. select) and write operations (e.g. update, insert, delete). A logically related sequence of operations can be called a transaction. These transactions come into the server concurrently from many sources.

Transaction processing deals with this concurrency.

When we deploy a Flask app, it can handle concurrent requests. Therefore, we will have to ensure that our coding doesn't suffer from deadlock and other anomalies. This reading discusses and explains some of those anomalies.

At the end of this reading, there are some important patterns to adopt in your SQL and Flask coding.

(Terminology note: this topic is sometimes called Online Transaction Processing (OLTP).)

Example

Let's start with a real world example. Imagine we are implementing a Flask app that will be a part of course registration. We write some code like this:

curs.execute('SELECT seats_available FROM course WHERE crn = %s',[crn])
n = curs.fetchone()[0]
if n > 0:
    curs.execute('INSERT INTO register(crn,bid) values (%s,%s)',
                 [crn,user_bid])
    curs.execute('UPDATE course SET seats_available = seats_available-1 
                  WHERE crn = %s', [crn])
    conn.commit()
else:
    flash('sorry, that class is full')

What could go wrong? This reading answers that question.

ACID

In supporting concurrency, we want to achieve the ACID properties. (Sorry, no Electric Kool-aid.)

  • A: Atomic. A transaction (several statements) occurs as if it were an atomic entity, happening at a moment in time. In particular, it either runs to completion and commits, or it aborts with no effect at all. That is, it's all or nothing.
  • C: Consistent. The database always appears to be consistent: no agent ever sees an inconsistent state (integrity constraints violated). An example of inconsistent state might be where Banner allows 26 people to register for a class with a limit of 25.
  • I: Isolation. The effect of a set of concurrent transactions is the same as some serialization (linear sequence) of that set.
  • D: Durability. The system must ensure that once a transaction commits, its effects remain in the database, even if the computer crashes. the Wikipedia article says it clearly: Durability can be achieved by flushing the transaction's log records to non-volatile storage before acknowledging commitment. We'll take this as an excuse to talk about RAID storage.

Note some new items of terminology:

  • commit: a transaction successfully completed and its effect is durable.
  • abort: a transaction did not complete, and decided to give up. It has no effect at all.
  • rollback: The changes of a partially completed transaction are undone. This implements an abort.

Isolation Sorting

Imagine an update to the database that reads an old value and writes a new value that is one more than the old one: an increment to a counter. The database could be counting the number of tickets sold or the number of seats filled on an airplane. Or suppose each transaction is a debit to a bank account, so it has to read the current balance, check to see that it's enough to cover the check, and then debit the account. Imagine that two of these transactions happen concurrently:

Question: What can go wrong? Clearly, if the two reads happen before either of the writes, both reads will be of the same value and the writes will both store the same value, but that will not be correct.

Just to be concrete, let's suppose the current value is 5. Both transactions read 5 and then both write 6, so after two concurrent increments, the final value is 6, when it should be 7.

Serializability

The term serializabilty means that a set of concurrent transactions is equivalent to some serial schedule. This improves concurrency without sacrificing isolation. That's what went wrong in the simultaneous increment problem, above. That is, reading each value followed by writing the updated value is not equivalent to any serialized schedule, where one transaction happened completely before the other.

Strict Two-Phase Locking

Strict two-phase locking is a way to achieve the ACID properties in a DBMS. The DBMS's concurrency control module has locks for each database datum, both read locks and write locks.

Question: why two kinds? Because we can allow two read locks to both be granted, but not two write locks, so the kind of lock matters. In fact, if either lock is a write lock, the second lock is refused. This table illustrates that, where cells have an X if the given pair of locks are incompatible.

    granted
    read write
requested read   X
write X X

This is called strict two-phase locking.

  • It's called two-phase, because you lock and unlock
  • It's called strict because locks are held until the transaction is done (not just until lock no longer needed)

What about abort? In the following example, the subscript indicates which transaction is doing the operation. What goes wrong?

What goes wrong is that the second transaction reads the value of X that the first transaction wrote, computes a new value based on it, then the first transaction was aborted, so second transaction is now based on something that didn't happen.

This trouble is avoided by ensuring that locks are held until the transaction commits or aborts. The set of transactions is serializable in the order in which they committed. In the scenario above, the second transaction shouldn't start (because it can't get a lock) until the first one either commits or aborts.

Dirty Read

A dirty read is reading a modified, uncommited value. In the following scenario, the second transaction read the value x, but that value didn't stick. Suppose the first transaction is writing a new bank balance, but the operation aborts, and the second transaction proceeds thinking there are x dollars in the account, but that's not correct.

Non-Repeatable Read

A non-repeatable read is reading a value twice and getting different values (or related values and getting inconsistent ones). Maybe the first read is to find out how many seats are left on the plane, and the second is to get the value so that we can decrement it. But when we re-read, we get a different value!

Lost Update

Lost update changes don't stick. (Like the parallel increments, above.) In the following scenario, the "y" that gets written has no effect. It might be nice to not have that check deducted from our bank account, but we wouldn't want the vendor to not record the fact that we paid!

Deadlock

In deadlock, we are looking at the effects of locking multiple items. For example, in making a plane reservation, we might need to lock both the reservations table (call it R) and the passengers table (call it P).

We have deadlock when neither of two concurrent transactions can proceed, because each holds a lock on some data the other needs:

In traffic, deadlock is often called gridlock: cars can't go because other other cars can't go.

Deadlock can happen in any system where a transaction can hold a lock while requesting another. What solutions can you think of? (We'll discuss your ideas in class.)

  1.  
  2.  
  3.  
  4.  

Dining Philosophers

The discusson of deadline usually leads to discussion of the dining philosophers problem. (When I learned it, they used chopsticks, not forks) The key ideas:

  • Processes acting in parallel or concurrently.
  • Need to share resources,
  • but have exclusive access to them
  • Want to avoid deadlock.
  • Also want to avoid starvation; that is, everyone eventually gets the resources they want.
  • We will not concern ourselves with the yuck factor of sharing eating implements 🍴

Locking in Relational Databases

There are two commonly implemented options for the granularity of locking in DBMSes:

  • lock whole table, based on SELECT, UPDATE, DELETE and INSERT
  • lock tuples.

The finer-grained option of just locking tuples allows for more concurrency, but a problem can be caused by phantoms: inserting a new tuple that wasn't locked in the original set. More-sophisticated databases handle this.

Isolation Levels

Many DBMSes allow the Database Administrator (a person) to choose the strictness of locking and its granularity. In increasing strength (less parallelizability).

  1. READ UNCOMMITTED: dirty reads possible. Don't get a read lock.
  2. READ COMMITTED: dirty reads impossible, but nonrepeatable reads are. Get a read lock, but release after read.
  3. REPEATABLE READ: nonrepeatable reads impossible, but phantoms are. lock all tuples read by a SELECT until end of transaction.
  4. SERIALIZEABLE: long-duration read locks on all needed tables.

Different transactions can run at different levels. In MySQL, you do this with the SET TRANSACTION statement.

Locking in MySQL

MySQL has statements to lock and unlock a set of tables: LOCK TABLES and UNLOCK TABLES. Note the plural: you lock all the tables you want to lock at once, not one at a time. That helps to avoid the deadlock situation we discussed above. The LOCK statement also unlocks any tables you currently have locked, so you can't lock one and then lock another without unlocking the first.

You can choose either a READ lock or a WRITE lock. As we saw when we talked about strict two-phase locking, two READ locks can run concurrently, but nothing can run concurrently with a transaction that has a WRITE lock.

Here are the new statements:

  • lock tables A, B, C read/write locks the given tables with either a read or write lock. We'll see an example in the demo.
  • unlock tables; unlocks all tables locked by the most recent lock statement.

You'll see these in the demo.

Transactions in MySQL

The ability to lock tables is simple, easy and useful, but significantly impedes progress, as our later demo will show. If we use InnoDB tables, we can do transactions. The start transaction statement creates a kind of virtual or parallel world, so that concurrent transactions can run in that parallel world (before the transaction started) and when the transaction commits, concurrent transactions run in the world after the transaction finished. Something like this:

The important thing is that the second transaction doesn't have to wait for the first to finish; it just runs in the before world. The transaction demo should make this clearer.

The necessary statments are:

  • start transaction starts a sequence of operations.
  • commit completes the sequence and switches to the after world
  • rollback aborts the sequence and restores the before world

Note that we must use the InnoDB type table, because only that table representation can do transactions in this way. The other tables work in an "autocommit" mode.

Durability

Returning to our ACID properties, let's look at how to ensure durability. The idea is to ensure that, even if there is a power outage or a disk failure, the database can be restored to a consistent state.

Write-ahead Log

One common approach is a write-ahead log, in which the DBMS records each transaction to the end of a transaction log and saves it to disk before modifying the database. Typically, the log is on a different disk, so that there isn't a single point of failure.

  • undo records: records previous value of changed items. To abort a transaction, just work backwards. Mark the beginning of a transaction, so that we can stop.
  • If recovering from a crash, only abort uncommitted transactions. Therefore, an entry is logged when a transaction commits or aborts.
  • periodically enter a checkpoint record, listing active transactions.

Learning More

There is lots of info at the MySQL site.

Practical Implications

The foregoing discussion is not purely theoretical. Our deployed Flask apps are multi-threaded and so we have to worry about concurrent database transactions. The following sections discuss several common situations that CS 304 students have to face and how to approach or solve them.

There are runnable demos in the downloads folder. Copy it in the usual way:

cd ~/cs304
cp -r ~cs340/pub/downloads/locking locking
cd locking

The directory has the following demos:

Baby Sitting Coop

The demos of locking and transactions use a

It uses a scenario in which we will implement a simple table of balances. Suppose these are babysitting hours in a co-op, but they could just as well be dollars. The idea is that if Al agrees to babysit for Chen for 3 hours, Al's balance goes up by 3 and Chen's goes down by 3. The total number of hours is fixed at, say, 100.

Here is the create-table-sit.sql file, which creates the baby-sitting co-op table.

-- Sets up a babysitting coop.

drop table if exists sit;
create table sit(
    name varchar(30) primary key,
    bal  int)
    engine = InnoDB;

insert into sit values
    ('al',20),
    ('bobby',20),
    ('chen',20),
    ('debby',20),
    ('ethan',20);

Here is the audit-sit.sql script, which checks that everything is kosher:

-- Audits the babysitting coop. The total number of hours should always be
-- 100 (or whatever the fixed value is).

select 'babysitting balance is', sum(bal), if(sum(bal)=100,'PASS','FAIL') 
from sit;

select * from sit;

Notice that SQL has an IF expression. Just a lagniappe for you; no need to learn it.

Lock Tables Demo

This demo shows how to use the lock tables statement that we learned earlier.

The following transfers one hour from bobby to al in two steps but without locking. This is not safe against concurrency. If an audit were to happen between the first and second step, it would fail: the books would not balance.

update sit set bal = bal - 1 where name='bobby';
update sit set bal = bal + 1 where name='al';

The following transfers one hour from bobby to al as an atomic transaction.

lock tables sit write; 
update sit set bal = bal - 1 where name='bobby'; 
update sit set bal = bal + 1 where name='al'; 
unlock tables; 

To run the demo, first create the tables:

mysql < create-table-sit.sql 

Then, do this:

  1. Have two bash terminals.
  2. Skip the lock tables and do the first update only.
  3. switch to the other bash terminal and run the audit script. The audit fails.
  4. do the second update.
  5. re-run the audit. This time it passes.
  6. repeat this procedure, but this time with locking; the audit script will stop until the lock is released.

Final notes:

  • You must lock all the tables you want in one statement. This avoids deadlock.
  • The LOCK TABLES statement implicitly unlocks any locked tables.
  • You cannot use a locked table more than once in a query; use aliases instead. Lock each alias.

Not every statement requires locking. Single statements don't, even if it updates multiple rows. To add 10 to each balance, you can do the following; no lock required:

update sit set balance = balance+10; 

Transaction Demo

The locking demo is cool, but significantly impeded progress. If we use InnoDB tables, we can do transactions. Here's an example of a transaction that is equivalent to the one in the locking demo.

start transaction; 
update sit set bal=bal+1 where name='al'; 
update sit set bal=bal-1 where name='bobby'; 
commit; 

To run the demo, start two terminals, and in one terminal, do the transction, one step at at time, and in the other terminal, run the audit script.

Isn't this cool?. Even cooler, you can abort a transaction partway through:

start transaction; 
update sit set bal=20; 
select * from sit; 
rollback; 
select * from sit; 

Not everything can be rolled back. Typically DDL can't.

Inserting Tuples with AUTO_INCREMENT

One common occurrence is inserting a new element into a table that has an auto_increment id column. Say, inserting a new user into a table of users. The question is: how do you find out what id was issued for this new user?

Here's a common coding pattern:

name = request.form.get('username')
curs.execute('insert into users(uid,username) values (NULL,%s)', 
              [name])
curs.execute('select max(uid) from users') 
newuid = curs.fetchone()[0] 

or even:

name = request.form.get('username')
curs.execute('insert into users(uid,username) values (NULL,%s)', 
              [name]) 
curs.execute('select uid from users where username = %s', [name]) 
newuid = curs.fetchone()[0] 

What's wrong with that?

  • What if two people sign up at the same time?
  • What if two people with the same name sign up at the same time?

This is often called a race condition. We'll demonstrate this with a multi-threaded Python script that creates those race conditions. (We'll talk more about threads next time). Threads are convenient here because they let us run code concurrently.

UID Race

The following code sets up two python threads that concurrently try to insert a name (Fred) into a table with auto_increment and then try to find out what the auto_increment value was. One uses the max(uid) approach and the other uses a MySQL function called last_insert_id().

There are two key parts of the code: (1) setting up the tables and (2) inserting the user. Since the tables are empty, one of those Freds should be UID 1 and the other should be UID 2.

Here's how the tables are set up:

create table user_race (
    uid integer auto_increment primary key,
    uname varchar(30)
);

Here's how the insert is done:

def insert_user(conn,name):
    '''inserts a user and returns the UID as determined by
computing the max or using last_insert_id'''
    curs = conn.cursor()
    curs.execute('insert into user_race(uname) values (%s)',
                 [name])
    conn.commit()
    if use_last_insert_id:
        curs.execute('select last_insert_id()')
    else:
        curs.execute('select max(uid) from user_race')
    row = curs.fetchone()
    return row[0]

The demo will just run the race in a shell and print the UIDs.

python uid_race.py

The moral is: use last_insert_id(). It works, and this is what it's for.

Insert if not Exists

Another common pattern is to want to insert something in a table, but only if it isn't already in the table. For example, suppose someone is choosing a username for your website. They pick hacker42. Is that taken? If not, they can be inserted with that username.

One way to do that is:

select uid,name from userpass where name = 'hacker42';
insert into userpass(uid,name) values (null,'hacker42');

with a suitable if statement between them. But that yields another race condition. It's a race if, by chance, two people both choose hacker42 at the same time. A small chance, yes, but more likely if the web app reports that hacker42 is available, and the user dithers about it for a while. That greatly extends the time between the select and the insert.

A solution that avoids locking is to just be optimistic and handle errors if we are wrong:

insert into userpass(uid,name) values (null,'hacker42');

Here's the key code from the demo:

def insert_user(conn,name):
    curs = conn.cursor()
    try:
        nr = curs.execute('insert into userpass(username) values (%s)',[name])
        conn.commit()
        return nr == 1
    except pymysql.IntegrityError as err:
        print('unable to insert {} due to {}'.format(name,repr(err)))
        return False

Now we'll run the demo in the shell and see how it works:

python insert_if_not_exists.py

Videos

Videos of all those demonstrations are on the videos page.

Extra Information

If you'd like to stop here, that's understandable, but I think the topic of dealing with disk failure is related, interesting, and worth a few more minutes of your time. The following is encouraged but not required: raid