Prepared Queries
This reading teaches us how to properly do parameterized queries -- that is, queries that contain values that are not known at the time of writing.
Static Queries¶
Let's start with a simple lookup query. Suppose we want to look up George Clooney's information from our Python program. (I'm omitting the code to cache the CNF credentials and to use the WMDB):
conn = dbi.connect()
curs = dbi.dict_cursor(conn)
curs.execute('select name,birthdate from person where nm = 123')
info = curs.fetchone()
That query is static. Assuming George doesn't change his name or birthdate, it'll always return the same information. Of course, we can have static queries that are more likely return different data from time to time. Here's similar code to find the youngest people (there might be ties) in the database:
conn = dbi.connect()
curs = dbi.dict_cursor(conn)
curs.execute('''select name,birthdate from person
where birthdate = (select min(birthdate) from person)'''
youngest = curs.fetchall()
Notice the use of a triple-quoted string to allow the SQL code to be nicely formatted and keep the line from being too long.
But those are both very limited, particularly the first. How many times will a non-trivial program want to look up George Clooney's info. Rarely.
We'll call these static queries because all the SQL code is known at the time that the code is written. We can test it in the MySQL shell, for debugging, since testing the MySQL shell will often give better error messages. The data may change because the database contents has changed, but the query code is static.
Dynamic Queries Done Wrong¶
Imagine we want to write a function to look up a person's info, given the person's NM. We then need to combine dynamic information (the unknown NM) with the rest of the SQL code.
The obvious but wrong thing to do is to combine the dynamic information with the SQL code using some kind of string interpolation. It's the purpose of this tutorial to show a better way and to convince you why the obvious way is wrong. It'll take us a couple of steps, but we'll get there.
Let's start with the wrong function, just to clarify what we mean:
def person_lookup_bad(conn, nm):
sql = 'select name,birthdate from person where nm = {}'.format(nm) # BAD
curs = conn.dict_cursor(conn)
curs.execute(sql)
return curs.fetchone()
This will work, which is what makes it so tempting. So, what's wrong with it?
It's wrong because it allows a SQL INJECTION attack, which has been for decades one of the top ten security vulnerabilities on the web.
SQL Injection¶
To understand SQL Injection, let's start with a static query. The following query deletes George Clooney from the Person table:
conn = dbi.connect()
curs = dbi.dict_cursor(conn)
curs.execute('delete from person where nm = 123')
conn.commit()
That static SQL code is obviously too limited to be generally useful, so suppose we want to write a generic deletion function using a dynamic query:
def delete_by_nm_bad(conn, nm):
curs = dbi.cursor(conn)
sql = 'delete from person where nm = {}'.format(nm) # BAD
curs.execute(sql)
conn.commit()
That seems okay; what's wrong with it? Let's focus on the SQL string that we are building. I will show the dynamic part like this. We have:
delete from person where nm = 123
I think we can all agree that there's nothing wrong with that; it's exactly how we intended the code to work. In fact, it's the same as the static query.
Suppose we make a small change. Instead of 123, the user specifies a different string:
delete from person where nm = 123 or 1 = 1
Now the dynamic part is not just a number but some SQL code that
includes a boolean expression, 1 = 1
, that is always true. Because
it's always true, the SQL statement will delete everyone from the
table.
But how could that happen in real life? Remember, we wrote a function, which means that someone else can invoke our function with an argument:
delete_by_nm_bad(conn, '123') # okay
delete_by_nm_bad(conn, '123 or 1 = 1') # Eeek!
If we are on a coding team, would one of our teammates be so evil as
to mis-use our delete
function like this? Hopefully not. But if by
some sequence of transfers of data, it's possible for a malicious user
or hacker to supply that evil argument to our function, then we are
in trouble.
This happens in real life. A web application takes some input from a user and, through a series of function calls, that input somehow gets to a SQL query or statement, and is used in a dynamic query and results in disaster.
How to solve this?
Solution #1: Sanitize Your Inputs¶
The obvious solution is to note that we are expecting the argument to our delete function to be an integer, so let's check for that.
def delete_by_nm_sanitized(conn, nm):
try:
val = int(nm)
except:
raise Exception('Rubbish! nm is not an int: {}'.format(nm))
curs = dbi.cursor(conn)
sql = 'delete from person where nm = {}'.format(val) # okay now
curs.execute(sql)
conn.commit()
That will work and be safe.
Another SQL Injection¶
Sanitizing is great, but it can be tough sometimes. Let's start with
another function, to delete people by name. I've separated the static
part of the SQL (which I've called tmpl
for template) from the
rendered SQL, where the template has been combined with the dynamic
data.
def delete_by_name_bad(conn, name):
curs = dbi.cursor(conn)
tmpl = "delete from person where name = '{}';"
sql = tmpl.format(name) # BAD
curs.execute(sql)
conn.commit()
Let's focus on the SQL template with the interpolated dynamic value:
delete from person where name = 'Robert';
There are quotation marks in the template, surrounding the user's name. It even ends with an unnecessary semi-colon. (The semi-colon is unnecssary in PyMySQL because it knows where the end of the string is.) Everything looks kosher.
Here's the attack:
delete from person where name = 'Robert'; drop table person; -- ';
Take a moment to think about this. The attack supplies the closing
quotation mark, a semi-colon to end the query, a drop table
statement to destroy part of the database, and a comment character to
comment out the closing quotation mark and semi-colon from the
template (and anything else in the template).
This is the attack described in this famous XKCD cartoon:
Note that this attack can be mounted even if the SQL is a query rather than a delete. Imagine the web application just allows you to look up some information, rather than delete anything:
select * from person where name = 'Robert'; drop table person; -- '
We still have disaster.
Sanitizing is Hard¶
We could try to sanitize the input, and we could be pretty successful, but hackers can be devious. This is not the time to describe all the nefarious kinds of inputs that hackers have come up with to evade sanitizing code. There's a better way: Prepared Queries.
Prepared Queries¶
For each query and statement, databases do the following:
- preparing the query:
- parsing the query
- planning the execution of that query
- executing the plan
It turns out that parsing and planning doesn't always require a complete, fully specified SQL statement. The plan for doing this query:
select name, birthdate from person where nm = 123
is exactly the same as any other value for the NM. Therefore, the database can parse and plan based on SQL code with placeholders in it. For example:
select name, birthdate from person where nm = ?
The question mark is a placeholder. It allows the DBMS to parse and plan the execution, with a "fill this value in later" input.
This is very much how the Python compiler handles code like this:
def collatz(x):
if (x % 2) == 0:
return x/2
else:
return 3*x + 1
The Python compiler can compile that function, figuring out the
necessary code to execute it, without knowing the value of x
. The
value can be supplied later. The same is true of Java, C and every
other programming language.
In MySQL and all other capable DBMSes, the database can get a string (with placeholders) and a list of values, and execute the query as follows:
- preparing the query:
- parsing the query with placeholders
- planning the execution of that query
- executing the plan, dropping in the values
Placeholders in PyMySQL¶
In PyMySQL, the placeholder is %s
. In other software I've used, the
placeholder is sometimes ?
and sometimes a named parameter. But we
will be using %s
.
I'm sure PyMySQL's choice was made because, historically, Python has
had a string interpolation language that uses %s
:
combined = '%s plus %s is %s' % (3, 4, 7)
The code above means exactly the same thing as:
combined = '{} plus {} is {}'.format(3, 4, 7)
However, the prepared statement syntax is not doing string interplation. The coincidence of syntax between PyMySQL's prepared queries and Python's string interpolation operator is misleading.
Prepared Queries¶
PyMySQL1 does prepared queries by have a two-argument version of the
execute
method, keeping the values separate from the static SQL and
handing both to the DBMS separately:
curs.execute(sql_with_placeholders, list_of_values)
Armed with prepared queries, let's re-write some of the examples above. Here's how we can look up someone by NM:
def person_lookup_good(conn, nm):
sql = 'select name,birthdate from person where nm = %s'
curs = conn.dict_cursor(conn)
curs.execute(sql, [nm]) # good
info = curs.fetchone()
return info
That was easy, and it doesn't require any further sanitizing of the inputs.
Here's the delete function:
def delete_by_nm_good(conn, nm):
curs = dbi.cursor(conn)
sql = 'delete from person where nm = %s'
curs.execute(sql, [nm])
conn.commit()
You can have any number of placeholders, and they correspond to values in order (just like function arguments and parameters). For example, if we have separate values for year, month and day of birth:
def lookup_by_birthdate_good(conn, year, month, date):
curs = dbi.cursor(conn)
sql = '''select nm, name, birthdate from person
where year(birthdate) = %s
and month(birthdate) = %s
and date(birthdate) = %s '''
curs.execute(sql, [year, month, date])
Prepared Queries for Efficiency¶
In some contexts, a series of identical queries can be done more efficiently using prepared queries. The idea is a variation on how the DBMS works. We said it does this:
- prepare the query:
- parsing the query with placeholders
- planning the execution of that query
- executing the plan, dropping in the values
Suppose the DBMS prepares the following query:
select name, birthdate from person where nm = %s
and then we give it a list of NM values: (123, 234, 345 ...)
Then the execution looks like:
- prepare the query:
- parsing the query with placeholders
- planning the execution of that query
- executing the plan, dropping in 123
- executing the plan, dropping in 234
- executing the plan, dropping in 345
- ...
The preparation only has to be done once, and the plan executed many times. If preparation is expensive, this saves time.
I believe that PyMySQL doesn't do this optimization, but I have worked with APIs that do. Something to keep in mind.
Summary¶
- SQL injection is a real problem
- It happens when untrustworthy data is parsed and executed with the SQL code
- It can be avoided by sanitizing the untrustworthy data, but even better...
- Use prepared queries to keep the untrustworthy data from being parsed at all.
-
to be honest, PyMySQL doesn't do real prepared queries. To do so, the API would have to send the first argument across the network separately from the second argument, where the first argument would be parsed and compiled and then combined with the second for execution. Instead, PyMySQL fakes prepared queries by escaping the SQL (essentially using the "sanitize the inputs" approach) and sending a single string to the DBMS. However, we will pretend that PyMySQL is doing the right thing and rely on it. I am not aware of any exploits of PyMySQL. We will even continue to call this "prepared queries", even though that's not really accurate. But I want to be honest with you about what PyMySQL is doing. ↩