cs304 logo

Class on MySQL and UTF8

Today, we'll look at strings in our database that are not all-ASCII.

Goal

By the end of today, you will be able to:

Concepts

Today's Example

Use curl to download utf8.tar.

DBI

First, we'll enhance the connection to the database. Because it's a bit more work now, and work that we always have to do, we'll put this function in a new module that I've called dbi for DataBase Interface.

Here's the function that gets a connection:

def getConn(db):
    conn = MySQLdb.connect(host='localhost',
                           user='ubuntu',
                           passwd='',
                           db=db)
    conn.set_character_set('utf8')
    curs = conn.cursor()
    curs.execute('set names utf8;')
    curs.execute('set character set utf8;')
    curs.execute('set character_set_connection=utf8;')
    return conn

These say that when we send stuff to the client, we want it sent using UTF8, which is a particular encoding of Unicode. UTF-8 is an extremely common encoding, and one that is not going to break if someone uses 💩 (pile of poo). Latin-1 is a different encoding and can't handle pile of poo.

Unicode strings

The MySQLdb api still gives us byte strings instead of unicode strings. So, we need to convert them. Since any Python code that uses non-ASCII needs these functions, we'll add them to our dbi module.

First a generic converter. This function assumes that the byte string is represented using UTF-8.

def utf8(val):
    return unicode(val,'utf8') if type(val) is str else val

Rows

Rows can be represented as either tuples or dictionaries, so we should be able to convert either kind:

def dict2utf8(dic):
    '''Because dictionaries are mutable,
this mutates the dictionary;
it also returns it'''
    for k,v in dic.iteritems():
        dic[k] = utf8(v)
    return dic

def tuple2utf8(tup):
    '''returns a new tuple, with byte strings
converted to unicode strings'''
    return tuple(map(utf8,tup))

For convenience, let's write a generic conversion function:

def row2utf8(row):
    if type(row) is tuple:
        return tuple2utf8(row)
    elif type(row) is dict:
        return dict2utf8(row)
    else:
        raise TypeError('row is of unhandled type')

That's the meat of the dbi.py module.

  1. use diff to compare wmdb-ascii.sql with wmdb.sql.
  2. load the wmdb.sql file into MySQL
  3. read the testing code end of the dbi.py module.
  4. run the dbi.py script to see it in action.

people.py

How does our custom database interaction module have to change? First, when we get our list of people, we need to convert them all to unicode strings:

def getPeople(conn):
    '''Returns a list of rows, as dictionaries.'''
    curs = conn.cursor(MySQLdb.cursors.DictCursor)
    curs.execute('select name,birthdate from person')
    all = curs.fetchall()
    for p in all:
        dbi.row2utf8(p)
    return all

That's pretty much it.

  1. read the testing code end of the people.py module.
  2. run the people.py script to see it in action.
  3. pipe the output through grep to focus on just Chalamet:
    python people.py | grep Chalamet
    

Browser Charset

We have to tell the browser that we are using UTF-8 (as opposed to Latin1 or ASCII). We've actually been doing this all along. Look at the top of templates/people-list.html and you'll see:

<meta charset="utf-8">

See more about the meta tag.

Our Flask App

How does our app have to change? Not at all! All the ugliness has been hidden.

@app.route('/people/')
def displayPeople():
    conn = dbi.getConn('wmdb')
    # we could also write a different query
    # getting a subset of the people
    # and render it with the same template
    all = people.getPeople(conn)
    now = servertime.now()
    desc = 'All people as of {}'.format(now), 
    return render_template('people-list.html', 
                          desc=desc,
                          people=all)
  1. run the app
  2. click on the people list link
  3. search for Chalamet to see that he's handled correctly.

Summary

Summer Example

You may find the ete.py example interesting:

ete_str = '\xc3\xa9t\xc3\xa9'
ete_utf8 = unicode(ete_str,'utf8')
ete_latin1 = ete_utf8.encode('latin1')

print 'byte string',len(ete_str),ete_str
print 'utf8',len(ete_utf8),ete_utf8.encode('utf8')
print 'latin1',len(ete_latin1),ete_latin1
  1. Run the example
  2. Consider the lengths of the strings. Is that what you expected?
  3. Consider the output. What's wrong with the latin1 string?
  4. Do
    printenv LC_ALL

    This shows that our shell is expecting UTF-8, not Latin1.