PyMySQL: The Python Database API
This discussion of the PyMySQL Python module assumes some background knowledge of Python that you may not have gotten in CS 111. So, I strongly urge you to read my Python 102 description of some intermediate Python concepts and techniques.
If you're feeling quite shaky on your knowledge of Python (maybe it's been a long time since you took CS 111 or you skipped CS 111 because of a high-school course in Python), I strongly recommend you start with my Python 101 crash course in Python.
We'll have a brief digression about modules and APIs and then get into PyMySQL.
- Code
- Server Time
- APIs
- Virtual Environments
- PyMySQL
- Overview
- Connecting to the Database Management System (DBMS)
- The Connection Object
- Using a Database
- Cursors
- Using this Script
- Prepared Queries
- %s as placeholder
- Bobby Tables
- Complete Example of Prepared Queries
- Commitment and Updates
- None and NULL
- Summary
- Tuples versus Dictionaries
Code¶
All the code examples come from files in the
~cs304/pub/downloads/pymysql
folder.
Server Time¶
This is a very simple script, but it's useful in its own right, and it
illustrates some important ideas. The following code is in a file called
servertime.py
# Script to show the time on the server
# Written by Scott D. Anderson
# scott.anderson@acm.org
import sys
from datetime import datetime
def now():
"""Returns a string for the current day and time.
The output is in something close to Internet format. It's not really
Internet format because it neither converts to UTC nor
appends the time zone. However, it will work nicely for MySQL.
"""
now = datetime.now()
return now.strftime("%Y-%m-%d %H:%M:%S")
if __name__ == '__main__':
print(now())
Here's how we could use it from a Unix shell:
$ python servertime.py
2020-03-01 15:13:37
Very cool, particularly if you're not familiar with the Unix date
command.
If we import that file, it provides the now()
function, but doesn't
print anything. That is, the part after the if
statement doesn't run.
APIs¶
An API is an Application Programming Interface, which means how one part of a multi-part program talks to another part.
The servertime
module is a ludicrously small example of that. When
we import
the servertime module into another Python file, we can
then use a function called servertime.now()
In general, a Python module might create a large set of functions, objects, classes and methods that the importing code can use to get something useful done.
Virtual Environments¶
The servertime.py
script imports the datetime
module, which is part of
the standard set of Python packages, so you can assume it's installed on
your system. However, if you need other packages installed, you'll either
need to be an administrator or use a virtual environment. Here is more
on virtual environments.
We will use virtual environments in CS 304 and install pymysql
and
flask
(as well as some other packages that we will get to later).
Python Database Connection
Now that we know a little Python and we can import the PyMySQL module, we will turn to that module, which lets us talk to a database back end.
PyMySQL¶
PyMySQL is a Python module that provides functions, objects and methods (an API) to interact with a MySQL DBMS. You should certainly spend a little time with one of the following:
- PyMySQL tutorial at TutorialsPoint. This is readable and almost complete. Note that it sometimes refers to PyMySQL and sometimes to MySQLdb, which is a version of the module that works with Python 2. Treat the two terms as interchangeable.
- PyMySQL Official Documentation This is comprehensive, but a reference, not a tutorial. However, there is a User Guide with some nice examples.
Overview¶
To understand how we will will work with databases from Python, we need to learn the following:
- How to connect to the DBMS, either with username/password or with a credentials file
- How to use a particular database, either our own personal database
or WMDB, and eventually the project database. This is just like the
use
statement we employed in oursql
batch files - How to execute queries
- How to do prepared queries, which is a parameterized query, allowing parameters from untrusted sources.
- How to execute updates, including commit
We may also learn some topics that are useful, but not essential for CS 304, such as command line arguments and others
Connecting to the Database Management System (DBMS)¶
When anyone connects to MySQL, they have to have a username and
password. That has been invisible to you in CS 304 because (1) I created
your MySQL accounts for you, and (2) the MySQL client (the mysql
command) doesn't ask you for a username and password. Instead, it reads
them from a credentials file called ~/.my.cnf
. Recall that ~/
means
that the file is in your home directory and that the leading dot in the
filename means that it doesn't show up in a normal ls
command. Instead,
you have to say
ls -A ~/.my.cnf
Each of you has your own unique version of that file, with your own username and (randomly generated) password in it.
When we connect to the DBMS from a Python script, we also have to supply a
username and password. We'll learn how to do that using the PyMySQL
package, and an alternative way using a home-grown package called
cs304dbi
.
The PyMySQL function to connect to the DBMS is called connect()
and it
can either take a username/password or the name of a credentials file. The
function returns an object called a connection
that has methods that we
will use. For now, we'll just get the connection object, and we will
typically put it in a variable called conn
.
Connecting to the DBSM using the PyMySQL Module
Fortunately, importing the module doesn't require any perverse capitalization:
import pymysql
conn = pymysql.connect(user='cs304guest',
password='secret', # not really
host='localhost',
database='wmdb')
There are a few reasons not to like the code above:
- If the code will be public (which it will in CS 304, since we will use GitHub), it's a security risk to have our username and password posted publicly
- We have to copy/paste the username and password to everyplace we connect, and we will connect from each of the many routes in our Flask app.
- It's tedious to modify the username/password since it gets copied to many places.
An improvement is to use the credentials file:
import pymysql
conn = pymysql.connect(read_default_file='~/.my.cnf', database='wmdb')
This is much better! The only downside of this is that, since our Flask
app will run for an extended duration and will connect many times during
its run, it'll re-read the .my.cnf
file many times, even though the file
will not change. So, an improvement is to factor the reading of the
credential file (and caching the values) from connecting to the
database. Furthermore, we can select the database and store that in the
cached credentials and have a connect()
function that (by default) uses
the cached information. Here's how it would work:
import pymysql
import cs304dbi as dbi
dbi.conf('wmdb') # only once
conn = dbi.connect() # as often as necessary
That looks like a step backwards, since it's two lines instead of one,
but the function to read the CNF file and cache the results and use
the database (the conf
function) only needs to happen once, so we
can do that when the app starts and never have to do it again. Then,
we can connect with just one simple line:
conn = dbi.connect()
The Connection Object¶
The examples above all created a connection object, which I
consistently stored in a variable called conn
. (You can, of course,
store it in any variable name you want, but I'll use conn
.)
What is this thing? Consider placing a phone call to someone, say your grandmother. Once grandma picks up, your phone is connected to her. You can ask her questions and she can respond. You can hand the phone to your mother, father, brother or sister and they can also talk to grandma. The telephone handset is an object that allow anyone to talk grandma and you can pass that around. It represents the connection. (Sometimes, there's a lot of static. You have a bad connection. You can hang up and call again, to get a new connection.)
The connection object represents our connection to the MySQL server. Other people can connect to the server as well (just like your cousins can also all grandma). They get their own connection objects.
We use the connection object for just a few operations:
- selecting what database we are using, and
- creating cursors
Using a Database¶
Once we connect, we may want to switch databases, either because our code
is working with multiple databases, or because the database that is set in
our ~/.my.cnf
file is not the one we want. (Of course, if we always
want the other database, we could just edit the credentials file. It's
just an ASCII file, and we can edit it with VS Code.)
In raw PyMySQL, you can use a method on the connection object:
conn.select_db('wmdb')
If you're using cs304dbi
with cached credentials and you want the next
connection to use a different database, just modify the cached credentials
before connecting. We saw that above:
dbi.conf('wmdb') # only once
conn = dbi.connect()
This is very convenient, and since the first steps (reading and caching the CNF and modifying it to use a different database) are conceptually distinct from making the connection, we will often modularize these in our Flask apps. We'll cover that when we get to it.
Cursors¶
Once we have a connection object, we need to be able to do queries and read the results. Both are done with special objects called cursors. (Indeed, every database API that I'm familiar with has cursors, so this is a very general notion, because of the issues we'll discuss now.) We get a cursor by requesting one from a connection object. Cursors are cheap, so feel free to get one for every query.
Doing a simple query is fairly easy. There is an execute
method on a
cursor that will send the given query string to the DBMS and read the
results.
Conceptually, reading the results seems easy but it comes in two
styles. One way is just to get all the results as a (possibly large)
list/array. To do that, there's a fetchall()
method on a cursor. But
because the results could be large and we might not want all of them, or
because we want to do something with each one before going on to the next,
or similar reasons, we can get the results one row at a time. There's a
fetchone()
method on a cursor to do that. If there are no more rows, the
fetchone()
method returns None
. (This API might remind you of Java's
iterators, and that's the correct intuition.)
Finally, we need to discuss how each row is represented in Python. Cursors can either represent a row as a tuple or a dictionary. You specify this when you request a cursor from the connection.
Here's an example (simplescript.py
) that shows an example of a tuple
cursor and the fetchone()
method:
import cs304dbi as dbi
dsn = dbi.conf('wmdb')
conn = dbi.connect()
curs = dbi.cursor(conn)
curs.execute('''
select name,birthdate
from person
where nm=123''')
row = curs.fetchone()
print('{} born on {}'.format(row[0],row[1]))
In that case, we know that (because the query is using NM
which is a
key) there can be only 0 or 1 result, so fetchone()
is a reasonable
choice.
The peopleDBI.py
file shows an alternative that gets multiple
results, represented as dictionaries:
'''Lists the first few names in the person table'''
import cs304dbi as dbi
dbi.conf('wmdb')
conn = dbi.connect()
curs = dbi.dict_cursor(conn)
curs.execute('''
select nm,name,birthdate
from person
where nm < 10''')
for row in curs.fetchall():
print('{} born on {}'.format(row['name'],row['birthdate']))
That script works, but it's not great coding style. people.py
is a
Python file that can also be used as a module in a larger system,
providing a function that returns a list of all the people in the
database:
import cs304dbi as dbi
# ==========================================================
# The functions that do most of the work.
def getPeople(conn):
'''Returns the name and birthdate of all the entries in
the person table, as a list of dictionaries.
'''
curs = dbi.dict_cursor(conn)
curs.execute('select name,birthdate from person')
return curs.fetchall()
# ==========================================================
# This starts the ball rolling, *if* the file is run as a
# script, rather than just being imported.
if __name__ == '__main__':
dbi.conf('wmdb')
conn = dbi.connect()
pl = getPeople(conn)
for person in pl:
print('{name} born on {date}'
.format(name=person['name'],
date=person['birthdate']))
Using this Script¶
You can use this script from the Unix command line very easily. Just activate your virtual environment and run python with the script, like this:
$ source path/to/venv/bin/activate
$ python people.py
Golden Tate born on 1988-08-02
Richard Sherman born on 1988-03-30
Channing Tatum born on 1980-04-26
Amanda Bynes born on 1986-04-03
Ann Fletcher born on 1966-05-01
Michael Sucsy born on 1973-02-14
...
Prepared Queries¶
The script above is okay, but it always returns everyone. What if we wanted just the people born in September? What if we wanted the user to be able to choose the month that they cared about. For that, we need a way to parameterize a query.
It turns out that simple-minded ways of combining a user's input with other SQL code and executing the resulting query can result in security vulnerabilities. Specifically, a perennial top-ten vulnerability called SQL injection, which is when the attacker can trick your code into executing some SQL code that comes from the attacker.
One important way to foil a SQL injection attack is to use prepared queries, which is a safe way to parameterize a query with untrustworthy data.
Suppose you have a query like this:
select * from person where month(birthdate) = <param>
furthermore, you have a value that came from the user, an untrustworthy
source. Suppose it's in a variable called sought_month
The wrong way to combine the two is to combine them as strings and then execute the result:
# this is wrong; don't do this
sql = 'select * from person where month(birthdate) = {}'.format(sought_month)
curs.execute(sql) # don't do this
The right way to do a parameterized query is to keep the trustworthy SQL code separate from the untrustworthy parameters.
PyMySQL does prepared queries as follows.1 The execute
method can take an optional second argument which is a list of
parameter values. Each parameter value is matched up with a
placeholder in the SQL string that is the first argument. In
PyMySQL, the placeholder is %s
:
sql = 'select * from person where month(birthdate) = %s'
vals = [sought_month]
curs.execute(sql, vals) # a prepared query
If you need to have more than one parameter, just use multiple placeholders. The placeholders are matched up with parameters from left to right, just like function calls:
sql = 'select * from person where month(birthdate) = %s and year(birthdate) = %s'
vals = [ sought_month, sought_year ]
curs.execute(sql, vals) # a prepared query
%s as placeholder¶
Different APIs use different placeholder characters. Some use a
question mark. Some allow named parameters. In some, like PyMySQL, the
placeholder is %s
. This is an unfortunate choice, since %s
is
also used for a kind of string formatting, which we are not doing.
I'm sure that %s was chosen because programmers were already familiar
with %s as a kind of placeholder in string formatting, but we want to
avoid string formatting.
Bobby Tables¶
Here's a famous XKCD cartoon illustrating the point:
The punch line talks about sanitizing your database inputs
but even
better is using prepared queries, because they absolutely keep the
nefarious user's inputs from being treated as SQL code, which is the crux
of a SQL injection attack.
Complete Example of Prepared Queries¶
Here's the complete example for peopleBornInMonth.py
import cs304dbi as dbi
# ==========================================================
# The functions that do most of the work.
def peopleBornInMonth(conn,month):
'''Returns the name and birthdate of people born in given month,
as a list of dictionaries
'''
curs = dbi.dict_cursor(conn)
# this uses a prepared query, since the 'month' comes
# from an untrustworthy source
curs.execute('''
select name,birthdate from person
where month(birthdate) = %s''',
[month])
return curs.fetchall()
# ==========================================================
# This starts the ball rolling, *if* the file is run as a
# script, rather than just being imported.
def usage():
print('''Usage: {script} month
Prints actors born in given month, a number from 1-12'''
.format(script=sys.argv[0]), file=sys.stderr)
if __name__ == '__main__':
import sys
if len(sys.argv) != 2:
usage()
elif not( sys.argv[1].isdigit() and
1 <= int(sys.argv[1]) <= 12):
usage()
else:
dbi.conf('wmdb')
conn = dbi.connect()
people = peopleBornInMonth(conn,int(sys.argv[1]))
for person in people:
print(('{} born on {}'
.format(person['name'], person['birthdate'])))
The example above allows the user to run the script with a command line argument that is the sought month. If you don't know about command-line arguments in Python, you can read more about it below, but it's not essential for this topic.
Note also the use of triple-quoted strings and the ability to spread strings over multiple lines makes it much easier to embed another language (SQL code) within Python, in a readable way. I strongly suggest using that anytime your embedded SQL code is anything but trivial.
We could use these same ideas to create a script that would easily let us insert data into the database.
Commitment and Updates¶
There's an important difference between the default behavior of the
PyMySQL client and the mysql
client. The mysql
client
automatically commits any changes (makes them permanent). The
PyMySQL
default behavior requires you to commit the changes by
using the commit()
on the connection object.
In raw PyMySQL:
conn = pymysql.connect(...)
curs = conn.cursor()
curs.execute('delete from pet')
conn.commit()
The personInsert.py
script is an example. Note that the connection is
done with just one line, since we want to use the default credentials file
and not use
a different database. So all we need is
conn = dbi.connect()
Here's the relevant function in the personInsert.py
file:
def personInsert(conn, nm, name, birthdate=None):
'''Inserts the given person'''
curs = dbi.cursor(conn)
# nr is number of rows affected. Should be 1
nr = curs.execute('''
insert into person(nm,name,birthdate,addedby)
values (%s,%s,%s,%s)''',
[nm,name,birthdate,STAFFID])
conn.commit()
return nr
Notice that because the .commit()
method is on the connection
object, not the cursor, it's convenient to pass connection objects
as arguments to functions that need to update with the database,
because then they can commit their changes. It's easy enough for them
to create a cursor.
None and NULL¶
We know that SQL has a special value NULL
that is used to represent
unknown or missing values. Note that NULL
is different from the
string 'NULL'
. We can store a NULL
in a numeric column (like, say,
director
) to say that it's unknown, but we can't store the string
'NULL'
in a numeric column. (It would get converted to zero.)
We know that Python has a special value None
that is different (has
a different datatype) than any other value. See
None In
short, it plays a similar role in the Python language that NULL
plays in SQL.
The PyMySQL API translates between None
and NULL
. So if you look
up the director of a movie where the database doesn't have one, you
get None
:
curs = dbi.dict_cursor(conn)
curs.execute('select director from movie where director is NULL')
row = curs.fetchone()
dir = row['director']
dir == None # True
Similarly, to set the value to NULL
you would use None
:
curs = dbi.cursor(conn)
curs.execute('update movie set director = %s where tt = %s',
[None,some_tt])
conn.commit()
Summary¶
Here are the essential steps to using a database from Python:
- import pymsql or cs304dbi
- connect to the database using appropriate credentials
- create a cursor of the kind (tuple or dictionary) you want
- execute the SQL statement you want, using the cursor. If it needs
parameters from an untrustworthy source, use a prepared query.
*
curs.execute(SQL)
for queries with no parameters, or *curs.execute(SQL_with_N_placeholders, list_of_N_values)
for queries with N dynamic values - fetch the results with either
fetchone()
orfetchall()
on the cursor
Tuples versus Dictionaries¶
Should you use tuple cursors or dictionary cursors? Some considerations I use:
Number of Columns
If I'm getting more than one or two columns of data, using results as tuples means I have to do some tedious and error-prone counting to get the indexes right:
curs = dbi.cursor(conn)
curs.execute('select a,x,b,y,c,z from tab')
rows = curs.fetchall()
for row in rows:
print(row[3]) # which column is this?
But using dictionaries, the code is practically self-documenting:
curs = dbi.dict_cursor(conn)
curs.execute('select a,x,b,y,c,z from tab')
rows = curs.fetchall()
for row in rows:
print(row['y']) # no need to say that this is column y
By the way, I use *
for the columns all the time in the MySQL shell
(client) but never in production code. I prefer to specify exactly what
data I am returning. Also, since the order of the columns specifies the
numbering of the elements of a tuple, *
is particularly hard to use.
Alternatively, you can use destructuring assignment to take apart the tuple:
curs = dbi.cursor(conn)
curs.execute('select a,x,b,y,c,z from tab')
rows = curs.fetchall()
for row in rows:
(a,x,b,y,c,z) = row
print(y) # there's no question that this is column y
That also works well, but if we have to pass the data to another function, we have to pass 6 values instead of a single dictionary, so on balance I prefer dictionaries.
Amount of Data
Dictionaries are not quite as compact as tuples, and sometimes that extra data matters. When I process all 300K rows of data in the diabetes database, Python runs out of memory if I use dictionary cursors, but it works fine if I use tuples.
-
This is a slight fib. PyMySQL doesn't do real prepared queries, but it does a good job of simulating them. We will treat it as if it does. Other Python_MySQL API modules do proper prepared queries, but they are harder to install, so we will stick to PyMySQL. ↩