Today, we'll look at strings in our database that are not all-ASCII.
By the end of today, you will be able to:
str1 = 'hello'
str2 = u'hello'
e1 = '\xc3\xa9'
e2 = u'\U00e9'
poo1 = '\xf0\x9f\x92\xa9'
poo2 = u'\U0001f4a9'
render
unicode strings into our HTML
templates and send them to the browse as UTF-8.Use curl
to download utf8.tar.
First, we'll enhance the connection to the database. Because it's a bit
more work now, and work that we always have to do, we'll put this function
in a new module that I've called dbi
for DataBase Interface.
Here's the function that gets a connection:
def getConn(db): conn = MySQLdb.connect(host='localhost', user='ubuntu', passwd='', db=db) conn.set_character_set('utf8') curs = conn.cursor() curs.execute('set names utf8;') curs.execute('set character set utf8;') curs.execute('set character_set_connection=utf8;') return conn
These say that when we send stuff to the client, we want it sent using UTF8, which is a particular encoding of Unicode. UTF-8 is an extremely common encoding, and one that is not going to break if someone uses 💩 (pile of poo). Latin-1 is a different encoding and can't handle pile of poo.
The MySQLdb
api still gives us byte strings
instead of unicode strings. So, we need to convert them. Since
any Python code that uses non-ASCII needs these functions, we'll add them
to our dbi
module.
First a generic converter. This function assumes that the byte string is represented using UTF-8.
def utf8(val): return unicode(val,'utf8') if type(val) is str else val
Rows can be represented as either tuples or dictionaries, so we should be able to convert either kind:
def dict2utf8(dic): '''Because dictionaries are mutable, this mutates the dictionary; it also returns it''' for k,v in dic.iteritems(): dic[k] = utf8(v) return dic def tuple2utf8(tup): '''returns a new tuple, with byte strings converted to unicode strings''' return tuple(map(utf8,tup))
For convenience, let's write a generic conversion function:
def row2utf8(row): if type(row) is tuple: return tuple2utf8(row) elif type(row) is dict: return dict2utf8(row) else: raise TypeError('row is of unhandled type')
That's the meat of the dbi.py
module.
diff
to compare wmdb-ascii.sql
with wmdb.sql
.
wmdb.sql
file into MySQL
dbi.py
module.
dbi.py
script to see it in action.
How does our custom database interaction module have to change? First, when we get our list of people, we need to convert them all to unicode strings:
def getPeople(conn): '''Returns a list of rows, as dictionaries.''' curs = conn.cursor(MySQLdb.cursors.DictCursor) curs.execute('select name,birthdate from person') all = curs.fetchall() for p in all: dbi.row2utf8(p) return all
That's pretty much it.
people.py
module.
people.py
script to see it in action.
python people.py | grep Chalamet
We have to tell the browser that we are using UTF-8 (as opposed to
Latin1 or ASCII). We've actually been doing this all along. Look at the
top of templates/people-list.html
and you'll see:
<meta charset="utf-8">
See more about the meta tag.
How does our app have to change? Not at all! All the ugliness has been hidden.
@app.route('/people/') def displayPeople(): conn = dbi.getConn('wmdb') # we could also write a different query # getting a subset of the people # and render it with the same template all = people.getPeople(conn) now = servertime.now() desc = 'All people as of {}'.format(now), return render_template('people-list.html', desc=desc, people=all)
You may find the ete.py
example interesting:
ete_str = '\xc3\xa9t\xc3\xa9' ete_utf8 = unicode(ete_str,'utf8') ete_latin1 = ete_utf8.encode('latin1') print 'byte string',len(ete_str),ete_str print 'utf8',len(ete_utf8),ete_utf8.encode('utf8') print 'latin1',len(ete_latin1),ete_latin1
printenv LC_ALL
This shows that our shell is expecting UTF-8, not Latin1.