Transaction Processing
Overview:
- Introduce some terminology, then
- look at some examples of transactions in real tasks
At a high level of abstraction, operations on the database consist of
read operations (e.g. select
) and write operations
(e.g. update
, insert
, delete
). A logically related sequence of
operations can be called a transaction. These transactions come into
the server concurrently from many sources.
Transaction processing deals with this concurrency.
When we deploy a Flask app, it can handle concurrent requests. Therefore, we will have to ensure that our coding doesn't suffer from deadlock and other anomalies. This reading discusses and explains some of those anomalies.
At the end of this reading, there are some important patterns to adopt in your SQL and Flask coding.
(Terminology note: this topic is sometimes called Online Transaction Processing (OLTP).)
- Example
- ACID
- Isolation Sorting
- Serializability
- Strict Two-Phase Locking
- Dirty Read
- Non-Repeatable Read
- Lost Update
- Deadlock
- Dining Philosophers
- Locking in Relational Databases
- Isolation Levels
- Locking in MySQL
- Transactions in MySQL
- Durability
- Write-ahead Log
- Learning More
- Practical Implications
- Baby Sitting Coop
- Lock Tables Demo
- Transaction Demo
- Inserting Tuples with AUTO_INCREMENT
- UID Race
- Insert if not Exists
- Videos
- Extra Information
Example¶
Let's start with a real world example. Imagine we are implementing a Flask app that will be a part of course registration. We write some code like this:
curs.execute('SELECT seats_available FROM course WHERE crn = %s',[crn])
n = curs.fetchone()[0]
if n > 0:
curs.execute('INSERT INTO register(crn,bid) values (%s,%s)',
[crn,user_bid])
curs.execute('UPDATE course SET seats_available = seats_available-1
WHERE crn = %s', [crn])
conn.commit()
else:
flash('sorry, that class is full')
What could go wrong? This reading answers that question.
ACID¶
In supporting concurrency, we want to achieve the ACID properties. (Sorry, no Electric Kool-aid.)
- A: Atomic. A transaction (several statements) occurs as if it
were an atomic entity, happening at a moment in time. In particular,
it either runs to completion and commits, or it aborts with no
effect at all. That is, it's
all or nothing
. - C: Consistent. The database always appears to be consistent: no agent ever sees an inconsistent state (integrity constraints violated). An example of inconsistent state might be where Banner allows 26 people to register for a class with a limit of 25.
- I: Isolation. The effect of a set of concurrent transactions is the same as some serialization (linear sequence) of that set.
- D: Durability. The system must ensure that once a transaction
commits, its effects remain in the database, even if the computer
crashes. the Wikipedia article says it clearly:
Durability can be achieved by flushing the transaction's log records to non-volatile storage before acknowledging commitment.
We'll take this as an excuse to talk about RAID storage.
Note some new items of terminology:
- commit: a transaction successfully completed and its effect is durable.
- abort: a transaction did not complete, and decided to give up. It has no effect at all.
- rollback: The changes of a partially completed transaction are undone. This implements an abort.
Isolation Sorting¶
Imagine an update to the database that reads an old value and writes a new value that is one more than the old one: an increment to a counter. The database could be counting the number of tickets sold or the number of seats filled on an airplane. Or suppose each transaction is a debit to a bank account, so it has to read the current balance, check to see that it's enough to cover the check, and then debit the account. Imagine that two of these transactions happen concurrently:
Question: What can go wrong? Clearly, if the two reads happen before either of the writes, both reads will be of the same value and the writes will both store the same value, but that will not be correct.
Just to be concrete, let's suppose the current value is 5. Both transactions read 5 and then both write 6, so after two concurrent increments, the final value is 6, when it should be 7.
Serializability¶
The term
serializabilty
means that a set of concurrent transactions is equivalent to some
serial schedule. This improves concurrency without sacrificing
isolation. That's what went wrong in the simultaneous increment
problem, above. That is, reading each value followed by writing the
updated value is not equivalent to any serialized schedule, where one
transaction happened completely before the other.
Strict Two-Phase Locking¶
Strict two-phase locking is a way to achieve the ACID properties in a DBMS. The DBMS's concurrency control module has locks for each database datum, both read locks and write locks.
Question: why two kinds? Because we can allow two read locks to both be granted, but not two write locks, so the kind of lock matters. In fact, if either lock is a write lock, the second lock is refused. This table illustrates that, where cells have an X if the given pair of locks are incompatible.
granted read write requested read X write X X
This is called strict two-phase
locking.
- It's called
two-phase,
because you lock and unlock - It's called
strict
because locks are held until the transaction is done (not just until lock no longer needed)
What about abort? In the following example, the subscript indicates which transaction is doing the operation. What goes wrong?
What goes wrong is that the second transaction reads the value of X
that the first transaction wrote, computes a new value based on it,
then the first transaction was aborted, so second transaction is now
based on something that didn't happen
.
This trouble is avoided by ensuring that locks are held until the transaction commits or aborts. The set of transactions is serializable in the order in which they committed. In the scenario above, the second transaction shouldn't start (because it can't get a lock) until the first one either commits or aborts.
Dirty Read¶
A dirty read is reading a modified, uncommited value. In the following scenario, the second transaction read the value x, but that value didn't stick. Suppose the first transaction is writing a new bank balance, but the operation aborts, and the second transaction proceeds thinking there are x dollars in the account, but that's not correct.
Non-Repeatable Read¶
A non-repeatable read is reading a value twice and getting different values (or related values and getting inconsistent ones). Maybe the first read is to find out how many seats are left on the plane, and the second is to get the value so that we can decrement it. But when we re-read, we get a different value!
Lost Update¶
Lost update changes don't stick. (Like the parallel increments, above.) In the following scenario, the "y" that gets written has no effect. It might be nice to not have that check deducted from our bank account, but we wouldn't want the vendor to not record the fact that we paid!
Deadlock¶
In deadlock, we are looking at the effects of locking multiple items. For example, in making a plane reservation, we might need to lock both the reservations table (call it R) and the passengers table (call it P).
We have deadlock when neither of two concurrent transactions can proceed, because each holds a lock on some data the other needs:
In traffic, deadlock is often called gridlock: cars can't go because other other cars can't go.
Deadlock can happen in any system where a transaction can hold a lock while requesting another. What solutions can you think of? (We'll discuss your ideas in class.)
Dining Philosophers¶
The discusson of deadline usually leads to discussion of the
dining
philosophers
problem. (When I learned it, they used chopsticks, not forks) The
key ideas:
- Processes acting in parallel or concurrently.
- Need to share resources,
- but have exclusive access to them
- Want to avoid deadlock.
- Also want to avoid starvation; that is, everyone eventually gets the resources they want.
- We will not concern ourselves with the yuck factor of sharing eating implements 🍴
Locking in Relational Databases¶
There are two commonly implemented options for the granularity of locking in DBMSes:
- lock whole table, based on SELECT, UPDATE, DELETE and INSERT
- lock tuples.
The finer-grained option of just locking tuples allows for more concurrency, but a problem can be caused by phantoms: inserting a new tuple that wasn't locked in the original set. More-sophisticated databases handle this.
Isolation Levels¶
Many DBMSes allow the Database Administrator (a person) to choose the strictness of locking and its granularity. In increasing strength (less parallelizability).
- READ UNCOMMITTED: dirty reads possible. Don't get a read lock.
- READ COMMITTED: dirty reads impossible, but nonrepeatable reads are. Get a read lock, but release after read.
- REPEATABLE READ: nonrepeatable reads impossible, but phantoms are. lock all tuples read by a SELECT until end of transaction.
- SERIALIZEABLE: long-duration read locks on all needed tables.
Different transactions can run at different levels. In MySQL, you do this with the SET TRANSACTION statement.
Locking in MySQL¶
MySQL has statements to lock and unlock a set of tables: LOCK TABLES and UNLOCK TABLES. Note the plural: you lock all the tables you want to lock at once, not one at a time. That helps to avoid the deadlock situation we discussed above. The LOCK statement also unlocks any tables you currently have locked, so you can't lock one and then lock another without unlocking the first.
You can choose either a READ lock or a WRITE lock. As we saw when we talked about strict two-phase locking, two READ locks can run concurrently, but nothing can run concurrently with a transaction that has a WRITE lock.
Here are the new statements:
lock tables A, B, C read/write
locks the given tables with either aread
orwrite
lock. We'll see an example in the demo.unlock tables;
unlocks all tables locked by the most recentlock
statement.
You'll see these in the demo later in this reading, and there's a video as well.
Transactions in MySQL¶
The ability to lock tables is simple, easy and useful, but
significantly impedes progress, as our later demo will show. If we use
InnoDB tables, we can do transactions. The start
transaction
statement creates a kind of virtual
or parallel
world,
so that concurrent transactions can run in that parallel world (before
the transaction started) and when the transaction commits, concurrent
transactions run in the world after the transaction
finished. Something like this:
The important thing is that the second transaction doesn't have to
wait for the first to finish; it just runs in the before
world. The transaction demo should make this
clearer.
The necessary statments are:
start transaction
starts a sequence of operations.commit
completes the sequence and switches to theafter
worldrollback
aborts the sequence and restores thebefore
world
Note that we must use the InnoDB type table, because only that table representation can do transactions in this way. The other tables work in an "autocommit" mode.
You'll also see these in the demo later in this reading, and there's a video as well.
Durability¶
Returning to our ACID properties, let's look at how to ensure durability. The idea is to ensure that, even if there is a power outage or a disk failure, the database can be restored to a consistent state.
Write-ahead Log¶
One common approach is a write-ahead log, in which the DBMS records
each transaction to the end of a transaction log
and saves it
to disk before modifying the database. Typically, the log is on a
different disk, so that there isn't a single point of failure.
- undo records: records previous value of changed items. To abort a transaction, just work backwards. Mark the beginning of a transaction, so that we can stop.
- If recovering from a crash, only abort uncommitted transactions. Therefore, an entry is logged when a transaction commits or aborts.
- periodically enter a checkpoint record, listing active transactions.
Learning More¶
There is lots of info at the MySQL site.
Practical Implications¶
The foregoing discussion is not purely theoretical. Our deployed Flask apps are multi-threaded and so we have to worry about concurrent database transactions. The following sections discuss several common situations that CS 304 students have to face and how to approach or solve them.
There are runnable demos in the downloads folder. Copy it in the usual way:
cd ~/cs304
cp -r ~cs340/pub/downloads/locking locking
cd locking
The directory has the following demos. There are videos of each of these in our videos.
- lock tables demo shows how to lock tables.
- transaction demo shows how to use transactions.
- UID Race which shows the advantage of
last_insert_id()
overmax()
- insert if not exists which shows the best way to insert something but only if it doesn't exist
Baby Sitting Coop¶
The demos of locking and transactions use a
It uses a scenario in which we will implement a simple table of balances. Suppose these are babysitting hours in a co-op, but they could just as well be dollars. The idea is that if Al agrees to babysit for Chen for 3 hours, Al's balance goes up by 3 and Chen's goes down by 3. The total number of hours is fixed at, say, 100.
Here is the create-table-sit.sql
file, which creates the
baby-sitting co-op table.
-- Sets up a babysitting coop.
drop table if exists sit;
create table sit(
name varchar(30) primary key,
bal int)
engine = InnoDB;
insert into sit values
('al',20),
('bobby',20),
('chen',20),
('debby',20),
('ethan',20);
Here is the audit-sit.sql
script, which checks that everything is
kosher:
-- Audits the babysitting coop. The total number of hours should always be
-- 100 (or whatever the fixed value is).
select 'babysitting balance is', sum(bal), if(sum(bal)=100,'PASS','FAIL')
from sit;
select * from sit;
Notice that SQL has an IF
expression. Just a lagniappe for you; no
need to learn it.
Lock Tables Demo¶
This demo shows how to use the lock tables statement that we learned earlier.
The following transfers one hour from bobby to al in two steps but without locking. This is not safe against concurrency. If an audit were to happen between the first and second step, it would fail: the books would not balance.
update sit set bal = bal - 1 where name='bobby';
update sit set bal = bal + 1 where name='al';
The following transfers one hour from bobby to al as an atomic transaction.
lock tables sit write;
update sit set bal = bal - 1 where name='bobby';
update sit set bal = bal + 1 where name='al';
unlock tables;
To run the demo, first create the tables:
mysql < create-table-sit.sql
Then, do this:
- Have two bash terminals.
- Skip the
lock tables
and do the first update only. - switch to the other bash terminal and run the audit script. The audit fails.
- do the second update.
- re-run the audit. This time it passes.
- repeat this procedure, but this time with locking; the audit script will stop until the lock is released.
Final notes:
- You must lock all the tables you want in one statement. This avoids deadlock.
- The LOCK TABLES statement implicitly unlocks any locked tables.
- You cannot use a locked table more than once in a query; use aliases instead. Lock each alias.
Not every statement requires locking. Single statements don't, even if it updates multiple rows. To add 10 to each balance, you can do the following; no lock required:
update sit set balance = balance+10;
Transaction Demo¶
The locking demo is cool, but significantly impeded progress. If we use InnoDB tables, we can do transactions. Here's an example of a transaction that is equivalent to the one in the locking demo.
start transaction;
update sit set bal=bal+1 where name='al';
update sit set bal=bal-1 where name='bobby';
commit;
To run the demo, start two terminals, and in one terminal, do the transction, one step at at time, and in the other terminal, run the audit script.
Isn't this cool?. Even cooler, you can abort a transaction partway through:
start transaction;
update sit set bal=20;
select * from sit;
rollback;
select * from sit;
Not everything can be rolled back. Typically DDL can't.
Inserting Tuples with AUTO_INCREMENT¶
One common occurrence is inserting a new element into a table that has
an auto_increment
id column. Say, inserting a new user into a table
of users. The question is: how do you find out what id
was issued
for this new user?
Here's a common coding pattern:
name = request.form.get('username')
curs.execute('insert into users(uid,username) values (NULL,%s)',
[name])
curs.execute('select max(uid) from users')
newuid = curs.fetchone()[0]
or even:
name = request.form.get('username')
curs.execute('insert into users(uid,username) values (NULL,%s)',
[name])
curs.execute('select uid from users where username = %s', [name])
newuid = curs.fetchone()[0]
What's wrong with that?
- What if two people sign up at the same time?
- What if two people with the same name sign up at the same time?
This is often called a race condition. We'll demonstrate this with a multi-threaded Python script that creates those race conditions. (We'll talk more about threads next time). Threads are convenient here because they let us run code concurrently.
UID Race¶
The following code sets up two python threads that concurrently try to
insert a name (Fred
) into a table with auto_increment
and
then try to find out what the auto_increment value was. One uses the
max(uid)
approach and the other uses a MySQL function called
last_insert_id()
.
There are two key parts of the code: (1) setting up the tables and (2) inserting the user. Since the tables are empty, one of those Freds should be UID 1 and the other should be UID 2.
Here's how the tables are set up:
create table user_race (
uid integer auto_increment primary key,
uname varchar(30)
);
Here's how the insert is done:
def insert_user(conn,name):
'''inserts a user and returns the UID as determined by
computing the max or using last_insert_id'''
curs = conn.cursor()
curs.execute('insert into user_race(uname) values (%s)',
[name])
conn.commit()
if use_last_insert_id:
curs.execute('select last_insert_id()')
else:
curs.execute('select max(uid) from user_race')
row = curs.fetchone()
return row[0]
The demo will just run the race in a shell and print the UIDs.
python uid_race.py
The moral is: use last_insert_id()
. It works, and this is what it's
for.
Insert if not Exists¶
Another common pattern is to want to insert something in a table, but
only if it isn't already in the table. For example, suppose someone is
choosing a username for your website. They pick hacker42
. Is that
taken? If not, they can be inserted with that username.
One way to do that is:
select uid,name from userpass where name = 'hacker42';
insert into userpass(uid,name) values (null,'hacker42');
with a suitable if
statement between them. But that yields another
race condition. It's a race if, by chance, two people both choose
hacker42
at the same time. A small chance, yes, but more likely if
the web app reports that hacker42
is available, and the user dithers
about it for a while. That greatly extends the time between the
select
and the insert
.
A solution that avoids locking is to just be optimistic and handle errors if we are wrong:
insert into userpass(uid,name) values (null,'hacker42');
Here's the key code from the demo:
def insert_user(conn,name):
curs = conn.cursor()
try:
nr = curs.execute('insert into userpass(username) values (%s)',[name])
conn.commit()
return nr == 1
except pymysql.IntegrityError as err:
print('unable to insert {} due to {}'.format(name,repr(err)))
return False
Now we'll run the demo in the shell and see how it works:
python insert_if_not_exists.py
Videos¶
Videos of all those demonstrations are on the videos page.
Extra Information¶
If you'd like to stop here, that's understandable, but I think the topic of dealing with disk failure is related, interesting, and worth a few more minutes of your time. The following is encouraged but not required: raid