Prepared Queries

This reading teaches us how to properly do parameterized queries -- that is, queries that contain values that are not known at the time of writing.

Static Queries

Let's start with a simple lookup query. Suppose we want to look up George Clooney's information from our Python program. (I'm omitting the code to cache the CNF credentials and to use the WMDB):

conn = dbi.connect()
curs = dbi.dict_cursor(conn)
curs.execute('select name,birthdate from person where nm = 123')
info = curs.fetchone()

That query is static. Assuming George doesn't change his name or birthdate, it'll always return the same information. Of course, we can have static queries that are more likely return different data from time to time. Here's similar code to find the youngest people (there might be ties) in the database:

conn = dbi.connect()
curs = dbi.dict_cursor(conn)
curs.execute('''select name,birthdate from person 
                where birthdate = (select min(birthdate) from person)'''
youngest = curs.fetchall()

Notice the use of a triple-quoted string to allow the SQL code to be nicely formatted and keep the line from being too long.

But those are both very limited, particularly the first. How many times will a non-trivial program want to look up George Clooney's info. Rarely.

We'll call these static queries because all the SQL code is known at the time that the code is written. We can test it in the MySQL shell, for debugging, since testing the MySQL shell will often give better error messages. The data may change because the database contents has changed, but the query code is static.

Dynamic Queries Done Wrong

Imagine we want to write a function to look up a person's info, given the person's NM. We then need to combine dynamic information (the unknown NM) with the rest of the SQL code.

The obvious but wrong thing to do is to combine the dynamic information with the SQL code using some kind of string interpolation. It's the purpose of this tutorial to show a better way and to convince you why the obvious way is wrong. It'll take us a couple of steps, but we'll get there.

Let's start with the wrong function, just to clarify what we mean:

def person_lookup_bad(conn, nm):
    sql = 'select name,birthdate from person where nm = {}'.format(nm)  # BAD
    curs = conn.dict_cursor(conn)
    curs.execute(sql)
    return curs.fetchone()

This will work, which is what makes it so tempting. So, what's wrong with it?

It's wrong because it allows a SQL INJECTION attack, which has been for decades one of the top ten security vulnerabilities on the web.

SQL Injection

To understand SQL Injection, let's start with a static query. The following query deletes George Clooney from the Person table:

conn = dbi.connect()
curs = dbi.dict_cursor(conn)
curs.execute('delete from person where nm = 123')
conn.commit()

That static SQL code is obviously too limited to be generally useful, so suppose we want to write a generic deletion function using a dynamic query:

def delete_by_nm_bad(conn, nm):
    curs = dbi.cursor(conn)
    sql = 'delete from person where nm = {}'.format(nm)  # BAD
    curs.execute(sql)
    conn.commit()

That seems okay; what's wrong with it? Let's focus on the SQL string that we are building. I will show the dynamic part like this. We have:

delete from person where nm = 123

I think we can all agree that there's nothing wrong with that; it's exactly how we intended the code to work. In fact, it's the same as the static query.

Suppose we make a small change. Instead of 123, the user specifies a different string:

delete from person where nm = 123 or 1 = 1

Now the dynamic part is not just a number but some SQL code that includes a boolean expression, 1 = 1, that is always true. Because it's always true, the SQL statement will delete everyone from the table.

But how could that happen in real life? Remember, we wrote a function, which means that someone else can invoke our function with an argument:

delete_by_nm_bad(conn, '123')          # okay
delete_by_nm_bad(conn, '123 or 1 = 1') # Eeek!

If we are on a coding team, would one of our teammates be so evil as to mis-use our delete function like this? Hopefully not. But if by some sequence of transfers of data, it's possible for a malicious user or hacker to supply that evil argument to our function, then we are in trouble.

This happens in real life. A web application takes some input from a user and, through a series of function calls, that input somehow gets to a SQL query or statement, and is used in a dynamic query and results in disaster.

How to solve this?

Solution #1: Sanitize Your Inputs

The obvious solution is to note that we are expecting the argument to our delete function to be an integer, so let's check for that.

def delete_by_nm_sanitized(conn, nm):
    try:
        val = int(nm)
    except:
        raise Exception('Rubbish! nm is not an int: {}'.format(nm))
    curs = dbi.cursor(conn)
    sql = 'delete from person where nm = {}'.format(val)  # okay now
    curs.execute(sql)
    conn.commit()

That will work and be safe.

Another SQL Injection

Sanitizing is great, but it can be tough sometimes. Let's start with another function, to delete people by name. I've separated the static part of the SQL (which I've called tmpl for template) from the rendered SQL, where the template has been combined with the dynamic data.

def delete_by_name_bad(conn, name):
    curs = dbi.cursor(conn)
    tmpl = "delete from person where name = '{}';"
    sql = tmpl.format(name) # BAD
    curs.execute(sql)
    conn.commit()

Let's focus on the SQL template with the interpolated dynamic value:

delete from person where name = 'Robert';

There are quotation marks in the template, surrounding the user's name. It even ends with an unnecessary semi-colon. (The semi-colon is unnecssary in PyMySQL because it knows where the end of the string is.) Everything looks kosher.

Here's the attack:

delete from person where name = 'Robert'; drop table person; -- ';

Take a moment to think about this. The attack supplies the closing quotation mark, a semi-colon to end the query, a drop table statement to destroy part of the database, and a comment character to comment out the closing quotation mark and semi-colon from the template (and anything else in the template).

This is the attack described in this famous XKCD cartoon:

Exploits of a Mom

Note that this attack can be mounted even if the SQL is a query rather than a delete. Imagine the web application just allows you to look up some information, rather than delete anything:

select * from person where name = 'Robert'; drop table person; -- '

We still have disaster.

Sanitizing is Hard

We could try to sanitize the input, and we could be pretty successful, but hackers can be devious. This is not the time to describe all the nefarious kinds of inputs that hackers have come up with to evade sanitizing code. There's a better way: Prepared Queries.

Prepared Queries

For each query and statement, databases do the following:

  • preparing the query:
    • parsing the query
    • planning the execution of that query
  • executing the plan

It turns out that parsing and planning doesn't always require a complete, fully specified SQL statement. The plan for doing this query:

select name, birthdate from person where nm = 123

is exactly the same as any other value for the NM. Therefore, the database can parse and plan based on SQL code with placeholders in it. For example:

select name, birthdate from person where nm = ?

The question mark is a placeholder. It allows the DBMS to parse and plan the execution, with a "fill this value in later" input.

This is very much how the Python compiler handles code like this:

def collatz(x):
    if (x % 2) == 0:
        return x/2
    else:
        return 3*x + 1

The Python compiler can compile that function, figuring out the necessary code to execute it, without knowing the value of x. The value can be supplied later. The same is true of Java, C and every other programming language.

In MySQL and all other capable DBMSes, the database can get a string (with placeholders) and a list of values, and execute the query as follows:

  • preparing the query:
    • parsing the query with placeholders
    • planning the execution of that query
  • executing the plan, dropping in the values

Placeholders in PyMySQL

In PyMySQL, the placeholder is %s. In other software I've used, the placeholder is sometimes ? and sometimes a named parameter. But we will be using %s.

I'm sure PyMySQL's choice was made because, historically, Python has had a string interpolation language that uses %s:

combined = '%s plus %s is %s' % (3, 4, 7)

The code above means exactly the same thing as:

combined = '{} plus {} is {}'.format(3, 4, 7)

However, the prepared statement syntax is not doing string interplation. The coincidence of syntax between PyMySQL's prepared queries and Python's string interpolation operator is misleading.

Prepared Queries

PyMySQL1 does prepared queries by have a two-argument version of the execute method, keeping the values separate from the static SQL and handing both to the DBMS separately:

curs.execute(sql_with_placeholders, list_of_values)

Armed with prepared queries, let's re-write some of the examples above. Here's how we can look up someone by NM:

def person_lookup_good(conn, nm):
    sql = 'select name,birthdate from person where nm = %s'
    curs = conn.dict_cursor(conn)
    curs.execute(sql, [nm])  # good
    info = curs.fetchone()
    return info

That was easy, and it doesn't require any further sanitizing of the inputs.

Here's the delete function:

def delete_by_nm_good(conn, nm):
    curs = dbi.cursor(conn)
    sql = 'delete from person where nm = %s'
    curs.execute(sql, [nm])
    conn.commit()

You can have any number of placeholders, and they correspond to values in order (just like function arguments and parameters). For example, if we have separate values for year, month and day of birth:

def lookup_by_birthdate_good(conn, year, month, date):
    curs = dbi.cursor(conn)
    sql = '''select nm, name, birthdate from person 
             where year(birthdate) = %s 
             and month(birthdate) = %s 
             and date(birthdate) = %s '''
    curs.execute(sql, [year, month, date])

Prepared Queries for Efficiency

In some contexts, a series of identical queries can be done more efficiently using prepared queries. The idea is a variation on how the DBMS works. We said it does this:

  • prepare the query:
    • parsing the query with placeholders
    • planning the execution of that query
  • executing the plan, dropping in the values

Suppose the DBMS prepares the following query:

select name, birthdate from person where nm = %s

and then we give it a list of NM values: (123, 234, 345 ...)

Then the execution looks like:

  • prepare the query:
    • parsing the query with placeholders
    • planning the execution of that query
  • executing the plan, dropping in 123
  • executing the plan, dropping in 234
  • executing the plan, dropping in 345
  • ...

The preparation only has to be done once, and the plan executed many times. If preparation is expensive, this saves time.

I believe that PyMySQL doesn't do this optimization, but I have worked with APIs that do. Something to keep in mind.

Summary

  • SQL injection is a real problem
  • It happens when untrustworthy data is parsed and executed with the SQL code
  • It can be avoided by sanitizing the untrustworthy data, but even better...
  • Use prepared queries to keep the untrustworthy data from being parsed at all.

  1. to be honest, PyMySQL doesn't do real prepared queries. To do so, the API would have to send the first argument across the network separately from the second argument, where the first argument would be parsed and compiled and then combined with the second for execution. Instead, PyMySQL fakes prepared queries by escaping the SQL (essentially using the "sanitize the inputs" approach) and sending a single string to the DBMS. However, we will pretend that PyMySQL is doing the right thing and rely on it. I am not aware of any exploits of PyMySQL. We will even continue to call this "prepared queries", even though that's not really accurate. But I want to be honest with you about what PyMySQL is doing.