PyMySQL: The Python Database API

This discussion of the PyMySQL Python module assumes some background knowledge of Python that you may not have gotten in CS 111. So, I strongly urge you to read my Python 102 description of some intermediate Python concepts and techniques.

If you're feeling quite shaky on your knowledge of Python (maybe it's been a long time since you took CS 111 or you skipped CS 111 because of a high-school course in Python), I strongly recommend you start with my Python 101 crash course in Python.

We'll have a brief digression about modules and APIs and then get into PyMySQL.

Code

All the code examples come from files in the ~cs304/pub/downloads/pymysql folder.

Server Time

This is a very simple script, but it's useful in its own right, and it illustrates some important ideas. The following code is in a file called servertime.py

# Script to show the time on the server

# Written by Scott D. Anderson
# scott.anderson@acm.org

import sys

from datetime import datetime

def now():
    """Returns a string for the current day and time.

    The output is in something close to Internet format. It's not really
    Internet format because it neither converts to UTC nor
    appends the time zone.  However, it will work nicely for MySQL.
    """
    now = datetime.now()
    return now.strftime("%Y-%m-%d %H:%M:%S")

if __name__ == '__main__':
    print(now())

Here's how we could use it from a Unix shell:

$  python servertime.py
2020-03-01 15:13:37 

Very cool, particularly if you're not familiar with the Unix date command.

If we import that file, it provides the now() function, but doesn't print anything. That is, the part after the if statement doesn't run.

APIs

An API is an Application Programming Interface, which means how one part of a multi-part program talks to another part.

The servertime module is a ludicrously small example of that. When we import the servertime module into another Python file, we can then use a function called servertime.now()

In general, a Python module might create a large set of functions, objects, classes and methods that the importing code can use to get something useful done.

Virtual Environments

The servertime.py script imports the datetime module, which is part of the standard set of Python packages, so you can assume it's installed on your system. However, if you need other packages installed, you'll either need to be an administrator or use a virtual environment. Here is more on virtual environments.

We will use virtual environments in CS 304 and install pymysql and flask (as well as some other packages that we will get to later).

Python Database Connection

Now that we know a little Python and we can import the PyMySQL module, we will turn to that module, which lets us talk to a database back end.

PyMySQL

PyMySQL is a Python module that provides functions, objects and methods (an API) to interact with a MySQL DBMS. You should certainly spend a little time with one of the following:

  • PyMySQL tutorial at TutorialsPoint. This is readable and almost complete. Note that it sometimes refers to PyMySQL and sometimes to MySQLdb, which is a version of the module that works with Python 2. Treat the two terms as interchangeable.
  • PyMySQL Official Documentation This is comprehensive, but a reference, not a tutorial. However, there is a User Guide with some nice examples.

Overview

To understand how we will will work with databases from Python, we need to learn the following:

  • How to connect to the DBMS, either with username/password or with a credentials file
  • How to use a particular database, either our own personal database or WMDB, and eventually the project database. This is just like the use statement we employed in our sql batch files
  • How to execute queries
  • How to do prepared queries, which is a parameterized query, allowing parameters from untrusted sources.
  • How to execute updates, including commit

We may also learn some topics that are useful, but not essential for CS 304, such as command line arguments and others

Connecting to the Database Management System (DBMS)

When anyone connects to MySQL, they have to have a username and password. That has been invisible to you in CS 304 because (1) I created your MySQL accounts for you, and (2) the MySQL client (the mysql command) doesn't ask you for a username and password. Instead, it reads them from a credentials file called ~/.my.cnf. Recall that ~/ means that the file is in your home directory and that the leading dot in the filename means that it doesn't show up in a normal ls command. Instead, you have to say

ls -A ~/.my.cnf

Each of you has your own unique version of that file, with your own username and (randomly generated) password in it.

When we connect to the DBMS from a Python script, we also have to supply a username and password. We'll learn how to do that using the PyMySQL package, and an alternative way using a home-grown package called cs304dbi.

The PyMySQL function to connect to the DBMS is called connect() and it can either take a username/password or the name of a credentials file. The function returns an object called a connection that has methods that we will use. For now, we'll just get the connection object, and we will typically put it in a variable called conn.

Connecting to the DBSM using the PyMySQL Module

Fortunately, importing the module doesn't require any perverse capitalization:

import pymysql

conn = pymysql.connect(user='cs304guest',
                       password='secret',  # not really
                       host='localhost',
                       database='wmdb')

There are a few reasons not to like the code above:

  1. If the code will be public (which it will in CS 304, since we will use GitHub), it's a security risk to have our username and password posted publicly
  2. We have to copy/paste the username and password to everyplace we connect, and we will connect from each of the many routes in our Flask app.
  3. It's tedious to modify the username/password since it gets copied to many places.

An improvement is to use the credentials file:

import pymysql

conn = pymysql.connect(read_default_file='~/.my.cnf', database='wmdb')

This is much better! The only downside of this is that, since our Flask app will run for an extended duration and will connect many times during its run, it'll re-read the .my.cnf file many times, even though the file will not change. So, an improvement is to factor the reading of the credential file (and caching the values) from connecting to the database. Furthermore, we can select the database and store that in the cached credentials and have a connect() function that (by default) uses the cached information. Here's how it would work:

import pymysql
import cs304dbi as dbi

dbi.conf('wmdb')  # only once

conn = dbi.connect() # as often as necessary

That looks like a step backwards, since it's two lines instead of one, but the function to read the CNF file and cache the results and use the database (the conf function) only needs to happen once, so we can do that when the app starts and never have to do it again. Then, we can connect with just one simple line:

conn = dbi.connect()

The Connection Object

The examples above all created a connection object, which I consistently stored in a variable called conn. (You can, of course, store it in any variable name you want, but I'll use conn.)

What is this thing? Consider placing a phone call to someone, say your grandmother. Once grandma picks up, your phone is connected to her. You can ask her questions and she can respond. You can hand the phone to your mother, father, brother or sister and they can also talk to grandma. The telephone handset is an object that allow anyone to talk grandma and you can pass that around. It represents the connection. (Sometimes, there's a lot of static. You have a bad connection. You can hang up and call again, to get a new connection.)

The connection object represents our connection to the MySQL server. Other people can connect to the server as well (just like your cousins can also all grandma). They get their own connection objects.

We use the connection object for just a few operations:

  • selecting what database we are using, and
  • creating cursors

Using a Database

Once we connect, we may want to switch databases, either because our code is working with multiple databases, or because the database that is set in our ~/.my.cnf file is not the one we want. (Of course, if we always want the other database, we could just edit the credentials file. It's just an ASCII file, and we can edit it with VS Code.)

In raw PyMySQL, you can use a method on the connection object:

conn.select_db('wmdb')

If you're using cs304dbi with cached credentials and you want the next connection to use a different database, just modify the cached credentials before connecting. We saw that above:

dbi.conf('wmdb')  # only once
conn = dbi.connect()

This is very convenient, and since the first steps (reading and caching the CNF and modifying it to use a different database) are conceptually distinct from making the connection, we will often modularize these in our Flask apps. We'll cover that when we get to it.

Cursors

Once we have a connection object, we need to be able to do queries and read the results. Both are done with special objects called cursors. (Indeed, every database API that I'm familiar with has cursors, so this is a very general notion, because of the issues we'll discuss now.) We get a cursor by requesting one from a connection object. Cursors are cheap, so feel free to get one for every query.

Doing a simple query is fairly easy. There is an execute method on a cursor that will send the given query string to the DBMS and read the results.

Conceptually, reading the results seems easy but it comes in two styles. One way is just to get all the results as a (possibly large) list/array. To do that, there's a fetchall() method on a cursor. But because the results could be large and we might not want all of them, or because we want to do something with each one before going on to the next, or similar reasons, we can get the results one row at a time. There's a fetchone() method on a cursor to do that. If there are no more rows, the fetchone() method returns None. (This API might remind you of Java's iterators, and that's the correct intuition.)

Finally, we need to discuss how each row is represented in Python. Cursors can either represent a row as a tuple or a dictionary. You specify this when you request a cursor from the connection.

Here's an example (simplescript.py) that shows an example of a tuple cursor and the fetchone() method:

import cs304dbi as dbi

dsn = dbi.conf('wmdb')
conn = dbi.connect()
curs = dbi.cursor(conn)
curs.execute('''
    select name,birthdate
    from person
    where nm=123''')
row = curs.fetchone()
print('{} born on {}'.format(row[0],row[1]))

In that case, we know that (because the query is using NM which is a key) there can be only 0 or 1 result, so fetchone() is a reasonable choice.

The peopleDBI.py file shows an alternative that gets multiple results, represented as dictionaries:

'''Lists the first few names in the person table'''

import cs304dbi as dbi

dbi.conf('wmdb')
conn = dbi.connect()
curs = dbi.dict_cursor(conn)
curs.execute('''
    select nm,name,birthdate
    from person
    where nm < 10''')
for row in curs.fetchall():
    print('{} born on {}'.format(row['name'],row['birthdate']))

That script works, but it's not great coding style. people.py is a Python file that can also be used as a module in a larger system, providing a function that returns a list of all the people in the database:

import cs304dbi as dbi

# ==========================================================
# The functions that do most of the work.

def getPeople(conn):
    '''Returns the name and birthdate of all the entries in
the person table, as a list of dictionaries.
    '''
    curs = dbi.dict_cursor(conn)
    curs.execute('select name,birthdate from person')
    return curs.fetchall()

# ==========================================================
# This starts the ball rolling, *if* the file is run as a
# script, rather than just being imported.    

if __name__ == '__main__':
    dbi.conf('wmdb')
    conn = dbi.connect()
    pl = getPeople(conn)
    for person in pl:
        print('{name} born on {date}'
              .format(name=person['name'],
                      date=person['birthdate']))

Using this Script

You can use this script from the Unix command line very easily. Just activate your virtual environment and run python with the script, like this:

$ source path/to/venv/bin/activate
$ python people.py 
Golden Tate born on 1988-08-02 
Richard Sherman born on 1988-03-30 
Channing Tatum born on 1980-04-26 
Amanda Bynes born on 1986-04-03 
Ann Fletcher born on 1966-05-01 
Michael Sucsy born on 1973-02-14 
... 

Prepared Queries

The script above is okay, but it always returns everyone. What if we wanted just the people born in September? What if we wanted the user to be able to choose the month that they cared about. For that, we need a way to parameterize a query.

It turns out that simple-minded ways of combining a user's input with other SQL code and executing the resulting query can result in security vulnerabilities. Specifically, a perennial top-ten vulnerability called SQL injection, which is when the attacker can trick your code into executing some SQL code that comes from the attacker.

One important way to foil a SQL injection attack is to use prepared queries, which is a safe way to parameterize a query with untrustworthy data.

Suppose you have a query like this:

select * from person where month(birthdate) = <param>

furthermore, you have a value that came from the user, an untrustworthy source. Suppose it's in a variable called sought_month

The wrong way to combine the two is to combine them as strings and then execute the result:

# this is wrong; don't do this
sql = 'select * from person where month(birthdate) = {}'.format(sought_month)
curs.execute(sql) # don't do this

The right way to do a parameterized query is to keep the trustworthy SQL code separate from the untrustworthy parameters.

PyMySQL does prepared queries as follows.1 The execute method can take an optional second argument which is a list of parameter values. Each parameter value is matched up with a placeholder in the SQL string that is the first argument. In PyMySQL, the placeholder is %s:

sql = 'select * from person where month(birthdate) = %s'
vals = [sought_month]
curs.execute(sql, vals)  # a prepared query

If you need to have more than one parameter, just use multiple placeholders. The placeholders are matched up with parameters from left to right, just like function calls:

sql = 'select * from person where month(birthdate) = %s and year(birthdate) = %s'
vals = [ sought_month, sought_year ]
curs.execute(sql, vals) # a prepared query

%s as placeholder

Different APIs use different placeholder characters. Some use a question mark. Some allow named parameters. In some, like PyMySQL, the placeholder is %s. This is an unfortunate choice, since %s is also used for a kind of string formatting, which we are not doing. I'm sure that %s was chosen because programmers were already familiar with %s as a kind of placeholder in string formatting, but we want to avoid string formatting.

Bobby Tables

Here's a famous XKCD cartoon illustrating the point:

XKCD cartoon about SQL injection

The punch line talks about sanitizing your database inputs but even better is using prepared queries, because they absolutely keep the nefarious user's inputs from being treated as SQL code, which is the crux of a SQL injection attack.

Complete Example of Prepared Queries

Here's the complete example for peopleBornInMonth.py

import cs304dbi as dbi

# ==========================================================
# The functions that do most of the work.

def peopleBornInMonth(conn,month):
    '''Returns the name and birthdate of people born in given month,
    as a list of dictionaries
    '''
    curs = dbi.dict_cursor(conn)
    # this uses a prepared query, since the 'month' comes
    # from an untrustworthy source
    curs.execute('''
        select name,birthdate from person 
        where month(birthdate) = %s''',
                 [month])
    return curs.fetchall()

# ==========================================================
# This starts the ball rolling, *if* the file is run as a
# script, rather than just being imported.    

def usage():
    print('''Usage: {script} month
    Prints actors born in given month, a number from 1-12'''
            .format(script=sys.argv[0]), file=sys.stderr)

if __name__ == '__main__':
    import sys
    if len(sys.argv) != 2:
        usage()
    elif not( sys.argv[1].isdigit() and
              1 <= int(sys.argv[1]) <= 12):
        usage()
    else:
        dbi.conf('wmdb')
        conn = dbi.connect()
        people = peopleBornInMonth(conn,int(sys.argv[1]))
        for person in people:
            print(('{} born on {}'
                  .format(person['name'], person['birthdate'])))

The example above allows the user to run the script with a command line argument that is the sought month. If you don't know about command-line arguments in Python, you can read more about it below, but it's not essential for this topic.

Note also the use of triple-quoted strings and the ability to spread strings over multiple lines makes it much easier to embed another language (SQL code) within Python, in a readable way. I strongly suggest using that anytime your embedded SQL code is anything but trivial.

We could use these same ideas to create a script that would easily let us insert data into the database.

Commitment and Updates

There's an important difference between the default behavior of the PyMySQL client and the mysql client. The mysql client automatically commits any changes (makes them permanent). The PyMySQL default behavior requires you to commit the changes by using the commit() on the connection object.

In raw PyMySQL:

conn = pymysql.connect(...) 
curs = conn.cursor()
curs.execute('delete from pet')
conn.commit()

The personInsert.py script is an example. Note that the connection is done with just one line, since we want to use the default credentials file and not use a different database. So all we need is

conn = dbi.connect()

Here's the relevant function in the personInsert.py file:

def personInsert(conn, nm, name, birthdate=None):
    '''Inserts the given person'''
    curs = dbi.cursor(conn)
    # nr is number of rows affected. Should be 1
    nr = curs.execute('''
        insert into person(nm,name,birthdate,addedby) 
        values (%s,%s,%s,%s)''',
                      [nm,name,birthdate,STAFFID])
    conn.commit()
    return nr

Notice that because the .commit() method is on the connection object, not the cursor, it's convenient to pass connection objects as arguments to functions that need to update with the database, because then they can commit their changes. It's easy enough for them to create a cursor.

None and NULL

We know that SQL has a special value NULL that is used to represent unknown or missing values. Note that NULL is different from the string 'NULL'. We can store a NULL in a numeric column (like, say, director) to say that it's unknown, but we can't store the string 'NULL' in a numeric column. (It would get converted to zero.)

We know that Python has a special value None that is different (has a different datatype) than any other value. See None In short, it plays a similar role in the Python language that NULL plays in SQL.

The PyMySQL API translates between None and NULL. So if you look up the director of a movie where the database doesn't have one, you get None:

curs = dbi.dict_cursor(conn)
curs.execute('select director from movie where director is NULL')
row = curs.fetchone()
dir = row['director']
dir == None   # True

Similarly, to set the value to NULL you would use None:

curs = dbi.cursor(conn)
curs.execute('update movie set director = %s where tt = %s',
             [None,some_tt])
conn.commit()

Summary

Here are the essential steps to using a database from Python:

  1. import pymsql or cs304dbi
  2. connect to the database using appropriate credentials
  3. create a cursor of the kind (tuple or dictionary) you want
  4. execute the SQL statement you want, using the cursor. If it needs parameters from an untrustworthy source, use a prepared query. * curs.execute(SQL) for queries with no parameters, or * curs.execute(SQL_with_N_placeholders, list_of_N_values) for queries with N dynamic values
  5. fetch the results with either fetchone() or fetchall() on the cursor

Tuples versus Dictionaries

Should you use tuple cursors or dictionary cursors? Some considerations I use:

Number of Columns

If I'm getting more than one or two columns of data, using results as tuples means I have to do some tedious and error-prone counting to get the indexes right:

curs = dbi.cursor(conn)
curs.execute('select a,x,b,y,c,z from tab')
rows = curs.fetchall()
for row in rows:
    print(row[3]) # which column is this?

But using dictionaries, the code is practically self-documenting:

curs = dbi.dict_cursor(conn)
curs.execute('select a,x,b,y,c,z from tab')
rows = curs.fetchall()
for row in rows:
    print(row['y']) # no need to say that this is column y

By the way, I use * for the columns all the time in the MySQL shell (client) but never in production code. I prefer to specify exactly what data I am returning. Also, since the order of the columns specifies the numbering of the elements of a tuple, * is particularly hard to use.

Alternatively, you can use destructuring assignment to take apart the tuple:

curs = dbi.cursor(conn)
curs.execute('select a,x,b,y,c,z from tab')
rows = curs.fetchall()
for row in rows:
    (a,x,b,y,c,z) = row
    print(y) # there's no question that this is column y

That also works well, but if we have to pass the data to another function, we have to pass 6 values instead of a single dictionary, so on balance I prefer dictionaries.

Amount of Data

Dictionaries are not quite as compact as tuples, and sometimes that extra data matters. When I process all 300K rows of data in the diabetes database, Python runs out of memory if I use dictionary cursors, but it works fine if I use tuples.


  1. This is a slight fib. PyMySQL doesn't do real prepared queries, but it does a good job of simulating them. We will treat it as if it does. Other Python_MySQL API modules do proper prepared queries, but they are harder to install, so we will stick to PyMySQL.