CS304: MySQL and UTF8

Class on MySQL and UTF8

Today, we'll look at strings in our database that are not all-ASCII.

Goal

By the end of today, you will be able to:

Read database values that contain UTF8-encoded strings
Render a template that displays them.

Concepts

In Python2, there's a difference between byte strings and unicode strings. Unicode strings are representation-neutral and can handle any unicode character. Byte strings are sequences of bytes, and so require knowing the representational encoding to understand them.
With plain ASCII characters, they look like this:
- str1 = 'hello'
- str2 = u'hello'
With non-ASCII characters, they might look like this (representing a single Unicode code-point, namely é (latin small letter e with accent acute)
- e1 = '\xc3\xa9'
- e2 = u'\U00e9'
With extremely non-ASCII characters, they might look like this (representing a single Unicode code-point, namely 💩 (pile of poo)
- poo1 = '\xf0\x9f\x92\xa9'
- poo2 = u'\U0001f4a9'
(In Python3, there are only unicode strings; things are very different, though you still have to know about encodings.)
You still need to know something about representations (encodings/decodings) in order to print unicode strings.
Later in the course, we'll learn about Unicode and different encodings (representations) of international character sets.
You can use several MySQL statements to tell the database what encoding you want the result sent as.
You then need to use those encodings to convert the byte strings to unicode strings.
We can directly render unicode strings into our HTML templates and send them to the browse as UTF-8.

Today's Example

Use curl to download utf8.tar.

DBI

First, we'll enhance the connection to the database. Because it's a bit more work now, and work that we always have to do, we'll put this function in a new module that I've called dbi for DataBase Interface.

Here's the function that gets a connection:

def getConn(db):
    conn = MySQLdb.connect(host='localhost',
                           user='ubuntu',
                           passwd='',
                           db=db)
    conn.set_character_set('utf8')
    curs = conn.cursor()
    curs.execute('set names utf8;')
    curs.execute('set character set utf8;')
    curs.execute('set character_set_connection=utf8;')
    return conn

These say that when we send stuff to the client, we want it sent using UTF8, which is a particular encoding of Unicode. UTF-8 is an extremely common encoding, and one that is not going to break if someone uses 💩 (pile of poo). Latin-1 is a different encoding and can't handle pile of poo.

Unicode strings

The MySQLdb api still gives us byte strings instead of unicode strings. So, we need to convert them. Since any Python code that uses non-ASCII needs these functions, we'll add them to our dbi module.

First a generic converter. This function assumes that the byte string is represented using UTF-8.

def utf8(val):
    return unicode(val,'utf8') if type(val) is str else val

Rows

Rows can be represented as either tuples or dictionaries, so we should be able to convert either kind:

def dict2utf8(dic):
    '''Because dictionaries are mutable,
this mutates the dictionary;
it also returns it'''
    for k,v in dic.iteritems():
        dic[k] = utf8(v)
    return dic

def tuple2utf8(tup):
    '''returns a new tuple, with byte strings
converted to unicode strings'''
    return tuple(map(utf8,tup))

For convenience, let's write a generic conversion function:

def row2utf8(row):
    if type(row) is tuple:
        return tuple2utf8(row)
    elif type(row) is dict:
        return dict2utf8(row)
    else:
        raise TypeError('row is of unhandled type')

That's the meat of the dbi.py module.

use diff to compare wmdb-ascii.sql with wmdb.sql.
load the wmdb.sql file into MySQL
read the testing code end of the dbi.py module.
run the dbi.py script to see it in action.

people.py

How does our custom database interaction module have to change? First, when we get our list of people, we need to convert them all to unicode strings:

def getPeople(conn):
    '''Returns a list of rows, as dictionaries.'''
    curs = conn.cursor(MySQLdb.cursors.DictCursor)
    curs.execute('select name,birthdate from person')
    all = curs.fetchall()
    for p in all:
        dbi.row2utf8(p)
    return all

That's pretty much it.

read the testing code end of the people.py module.
run the people.py script to see it in action.
pipe the output through grep to focus on just Chalamet:
```
python people.py | grep Chalamet
```

Browser Charset

We have to tell the browser that we are using UTF-8 (as opposed to Latin1 or ASCII). We've actually been doing this all along. Look at the top of templates/people-list.html and you'll see:

<meta charset="utf-8">

See more about the meta tag.

Our Flask App

How does our app have to change? Not at all! All the ugliness has been hidden.

@app.route('/people/')
def displayPeople():
    conn = dbi.getConn('wmdb')
    # we could also write a different query
    # getting a subset of the people
    # and render it with the same template
    all = people.getPeople(conn)
    now = servertime.now()
    desc = 'All people as of {}'.format(now), 
    return render_template('people-list.html', 
                          desc=desc,
                          people=all)

run the app
click on the people list link
search for Chalamet to see that he's handled correctly.

Summary

Use UTF-8 throughout, and you'll probably be safe.

Summer Example

You may find the ete.py example interesting:

ete_str = '\xc3\xa9t\xc3\xa9'
ete_utf8 = unicode(ete_str,'utf8')
ete_latin1 = ete_utf8.encode('latin1')

print 'byte string',len(ete_str),ete_str
print 'utf8',len(ete_utf8),ete_utf8.encode('utf8')
print 'latin1',len(ete_latin1),ete_latin1

Run the example
Consider the lengths of the strings. Is that what you expected?
Consider the output. What's wrong with the latin1 string?
Do
```
printenv LC_ALL
```
This shows that our shell is expecting UTF-8, not Latin1.